2009-01-07 03:58:15 +00:00
.. _intro-overview:
2008-12-26 12:21:53 +00:00
2009-01-07 03:58:15 +00:00
==================
Scrapy at a glance
==================
2008-12-26 12:21:53 +00:00
2010-08-21 01:26:35 -03:00
Scrapy is an application framework for crawling web sites and extracting
2009-01-07 03:58:15 +00:00
structured data which can be used for a wide range of useful applications, like
data mining, information processing or historical archival.
2008-12-26 12:21:53 +00:00
2010-03-20 20:24:18 -03:00
Even though Scrapy was originally designed for `screen scraping`_ (more
precisely, `web scraping`_ ), it can also be used to extract data using APIs
(such as `Amazon Associates Web Services`_ ) or as a general purpose web
crawler.
2008-12-26 12:21:53 +00:00
2009-01-07 03:58:15 +00:00
The purpose of this document is to introduce you to the concepts behind Scrapy
2014-01-12 23:46:04 -06:00
so you can get an idea of how it works and decide if Scrapy is what you need.
2008-12-26 12:21:53 +00:00
2009-01-07 03:58:15 +00:00
When you're ready to start a project, you can :ref:`start with the tutorial
2009-08-21 21:49:54 -03:00
<intro-tutorial> `.
2008-12-26 12:21:53 +00:00
2009-01-07 03:58:15 +00:00
Pick a website
==============
2008-12-26 12:21:53 +00:00
2009-01-07 03:58:15 +00:00
So you need to extract some information from a website, but the website doesn't
2010-09-05 23:38:37 -03:00
provide any API or mechanism to access that info programmatically. Scrapy can
help you extract that information.
2008-12-26 12:21:53 +00:00
2010-09-05 23:38:37 -03:00
Let's say we want to extract the URL, name, description and size of all torrent
files added today in the `Mininova`_ site.
2009-01-07 03:58:15 +00:00
2010-09-05 23:38:37 -03:00
The list of all torrents added today can be found on this page:
2009-01-07 03:58:15 +00:00
http://www.mininova.org/today
2014-01-12 23:46:04 -06:00
2010-09-05 23:38:37 -03:00
.. _intro-overview-item:
Define the data you want to scrape
==================================
The first thing is to define the data we want to scrape. In Scrapy, this is
done through :ref: `Scrapy Items <topics-items>` (Torrent files, in this case).
2009-01-07 03:58:15 +00:00
2010-09-05 23:38:37 -03:00
This would be our Item::
2011-08-11 09:11:08 -03:00
from scrapy.item import Item, Field
2010-09-05 23:38:37 -03:00
2013-04-08 14:13:15 +02:00
class TorrentItem(Item):
2010-09-05 23:38:37 -03:00
url = Field()
name = Field()
description = Field()
size = Field()
Write a Spider to extract the data
==================================
The next thing is to write a Spider which defines the start URL
2009-01-07 12:56:40 +00:00
(http://www.mininova.org/today), the rules for following links and the rules
for extracting the data from pages.
2009-01-07 03:58:15 +00:00
If we take a look at that page content we'll see that all torrent URLs are like
2014-04-09 18:57:52 -03:00
`` http://www.mininova.org/tor/NUMBER `` where `` NUMBER `` is an integer. We'll use
2009-01-07 12:56:40 +00:00
that to construct the regular expression for the links to follow: `` /tor/\d+ `` .
2009-01-07 03:58:15 +00:00
2010-09-05 23:38:37 -03:00
We'll use `XPath`_ for selecting the data to extract from the web page HTML
source. Let's take one of those torrent pages:
2009-01-07 03:58:15 +00:00
2014-01-12 23:46:04 -06:00
http://www.mininova.org/tor/2676093
2009-01-07 03:58:15 +00:00
And look at the page HTML source to construct the XPath to select the data we
2010-08-21 01:26:35 -03:00
want which is: torrent name, description and size.
2009-01-07 03:58:15 +00:00
.. highlight :: html
By looking at the page HTML source we can see that the file name is contained
inside a `` <h1> `` tag::
2014-01-12 23:46:04 -06:00
<h1>Darwin - The Evolution Of An Exhibition</h1>
2009-01-07 03:58:15 +00:00
.. highlight :: none
An XPath expression to extract the name could be::
//h1/text()
.. highlight :: html
And the description is contained inside a `` <div> `` tag with `` id="description" `` ::
2009-08-24 15:11:04 -03:00
<h2>Description:</h2>
2009-01-07 03:58:15 +00:00
2009-08-24 15:11:04 -03:00
<div id="description">
2014-01-12 23:46:04 -06:00
Short documentary made for Plymouth City Museum and Art Gallery regarding the setup of an exhibit about Charles Darwin in conjunction with the 200th anniversary of his birth.
2009-01-07 03:58:15 +00:00
2009-08-24 15:11:04 -03:00
...
2009-01-07 03:58:15 +00:00
.. highlight :: none
An XPath expression to select the description could be::
//div[@id='description']
.. highlight :: html
Finally, the file size is contained in the second `` <p> `` tag inside the `` <div> ``
2009-08-24 15:11:04 -03:00
tag with `` id=specifications `` ::
2009-01-07 03:58:15 +00:00
2009-08-24 15:11:04 -03:00
<div id="specifications">
2009-01-07 03:58:15 +00:00
<p>
<strong>Category:</strong>
2009-08-24 15:11:04 -03:00
<a href="/cat/4">Movies</a> > <a href="/sub/35">Documentary</a>
2009-01-07 03:58:15 +00:00
</p>
<p>
<strong>Total size:</strong>
2014-01-12 23:46:04 -06:00
150.62 megabyte</p>
2009-08-24 15:11:04 -03:00
2009-01-07 03:58:15 +00:00
.. highlight :: none
2013-08-03 17:08:58 -07:00
An XPath expression to select the file size could be::
2009-01-07 03:58:15 +00:00
2009-08-24 15:11:04 -03:00
//div[@id='specifications']/p[2]/text()[2]
2009-01-07 03:58:15 +00:00
.. highlight :: python
For more information about XPath see the `XPath reference`_ .
Finally, here's the spider code::
2014-01-10 10:37:27 -08:00
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
2014-01-14 21:04:15 -08:00
from scrapy.selector import Selector
2014-01-10 10:37:27 -08:00
2009-01-07 03:58:15 +00:00
class MininovaSpider(CrawlSpider):
2013-10-19 23:03:20 -04:00
name = 'mininova'
2010-04-01 18:27:22 -03:00
allowed_domains = ['mininova.org']
2009-01-07 03:58:15 +00:00
start_urls = ['http://www.mininova.org/today']
2009-05-18 19:19:37 -03:00
rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]
2014-01-12 23:46:04 -06:00
2009-01-07 03:58:15 +00:00
def parse_torrent(self, response):
2013-10-15 15:58:36 -02:00
sel = Selector(response)
2009-08-19 21:39:58 -03:00
torrent = TorrentItem()
torrent['url'] = response.url
2013-10-15 15:58:36 -02:00
torrent['name'] = sel.xpath("//h1/text()").extract()
torrent['description'] = sel.xpath("//div[@id='description']").extract()
torrent['size'] = sel.xpath("//div[@id='info-left']/p[2]/text()[2]").extract()
2009-09-22 11:25:40 -03:00
return torrent
2009-01-07 03:58:15 +00:00
2014-01-15 17:29:23 -08:00
The `` TorrentItem `` class is :ref: `defined above <intro-overview-item>` .
2009-01-07 03:58:15 +00:00
2010-09-05 23:38:37 -03:00
Run the spider to extract the data
==================================
2009-01-07 03:58:15 +00:00
2010-09-05 23:38:37 -03:00
Finally, we'll run the spider to crawl the site an output file
`` scraped_data.json `` with the scraped data in JSON format::
2009-01-07 03:58:15 +00:00
2013-10-19 23:03:20 -04:00
scrapy crawl mininova -o scraped_data.json -t json
2009-01-07 03:58:15 +00:00
2010-09-05 23:38:37 -03:00
This uses :ref: `feed exports <topics-feed-exports>` to generate the JSON file.
You can easily change the export format (XML or CSV, for example) or the
storage backend (FTP or `Amazon S3`_ , for example).
2009-01-07 03:58:15 +00:00
2010-09-05 23:38:37 -03:00
You can also write an :ref: `item pipeline <topics-item-pipeline>` to store the
items in a database very easily.
2009-01-07 03:58:15 +00:00
2010-09-06 10:04:00 -03:00
Review scraped data
===================
If you check the `` scraped_data.json `` file after the process finishes, you'll
see the scraped items there::
2014-01-13 00:03:58 -06:00
[{"url": "http://www.mininova.org/tor/2676093", "name": ["Darwin - The Evolution Of An Exhibition"], "description": ["Short documentary made for Plymouth ..."], "size": ["150.62 megabyte"]},
2010-09-06 10:04:00 -03:00
# ... other items ...
]
You'll notice that all field values (except for the `` url `` which was assigned
directly) are actually lists. This is because the :ref:`selectors
<topics-selectors> ` return lists. You may want to store single values, or
perform some additional parsing/cleansing to the values. That's what
:ref: `Item Loaders <topics-loaders>` are for.
2012-04-19 02:37:22 -03:00
.. _topics-whatelse:
2009-01-07 14:02:28 +00:00
What else?
2009-01-07 03:58:15 +00:00
==========
You've seen how to extract and store items from a website using Scrapy, but
this is just the surface. Scrapy provides a lot of powerful features for making
2009-01-07 14:02:28 +00:00
scraping easy and efficient, such as:
2010-03-20 20:24:18 -03:00
* Built-in support for :ref: `selecting and extracting <topics-selectors>` data
from HTML and XML sources
2009-01-07 14:02:28 +00:00
2010-09-05 23:38:37 -03:00
* Built-in support for cleaning and sanitizing the scraped data using a
2010-09-06 10:04:00 -03:00
collection of reusable filters (called :ref: `Item Loaders <topics-loaders>` )
2010-09-05 23:38:37 -03:00
shared between all the spiders.
2010-08-17 14:27:48 -03:00
* Built-in support for :ref: `generating feed exports <topics-feed-exports>` in
multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP,
2010-09-05 23:38:37 -03:00
S3, local filesystem)
2009-01-07 14:02:28 +00:00
2010-03-20 20:24:18 -03:00
* A media pipeline for :ref: `automatically downloading images <topics-images>`
(or any other media) associated with the scraped items
* Support for :ref: `extending Scrapy <extending-scrapy>` by plugging
2010-09-05 23:38:37 -03:00
your own functionality using :ref: `signals <topics-signals>` and a
well-defined API (middlewares, :ref: `extensions <topics-extensions>` , and
:ref: `pipelines <topics-item-pipeline>` ).
2009-01-07 14:02:28 +00:00
2010-10-16 19:02:08 -02:00
* Wide range of built-in middlewares and extensions for:
* cookies and session handling
* HTTP compression
* HTTP authentication
* HTTP cache
* user-agent spoofing
* robots.txt
* crawl depth restriction
* and more
* Robust encoding support and auto-detection, for dealing with foreign,
non-standard and broken encoding declarations.
2009-01-07 14:02:28 +00:00
2012-04-19 02:37:22 -03:00
* Support for creating spiders based on pre-defined templates, to speed up
spider creation and make their code more consistent on large projects. See
:command: `genspider` command for more details.
2010-09-05 23:38:37 -03:00
* Extensible :ref: `stats collection <topics-stats>` for multiple spider
metrics, useful for monitoring the performance of your spiders and detecting
when they get broken
* An :ref: `Interactive shell console <topics-shell>` for trying XPaths, very
useful for writing and debugging your spiders
* A :ref: `System service <topics-scrapyd>` designed to ease the deployment and
run of your spiders in production.
2010-03-20 20:24:18 -03:00
2010-09-05 23:38:37 -03:00
* A built-in :ref: `Web service <topics-webservice>` for monitoring and
2010-03-20 20:24:18 -03:00
controlling your bot
2009-01-07 14:02:28 +00:00
2010-09-05 23:38:37 -03:00
* A :ref: `Telnet console <topics-telnetconsole>` for hooking into a Python
console running inside your Scrapy process, to introspect and debug your
2010-03-20 20:24:18 -03:00
crawler
2009-01-07 14:02:28 +00:00
2010-10-16 19:02:08 -02:00
* :ref: `Logging <topics-logging>` facility that you can hook on to for catching
errors during the scraping process.
2009-01-07 14:02:28 +00:00
2011-06-15 11:54:34 -03:00
* Support for crawling based on URLs discovered through `Sitemaps`_
2011-07-27 03:45:15 -03:00
* A caching DNS resolver
2009-01-09 22:45:58 +00:00
What's next?
============
2009-01-07 03:58:15 +00:00
The next obvious steps are for you to `download Scrapy`_ , read :ref:`the
2009-04-12 09:05:00 +00:00
tutorial <intro-tutorial>` and join `the community`_ . Thanks for your
2009-01-07 03:58:15 +00:00
interest!
.. _download Scrapy: http://scrapy.org/download/
.. _the community: http://scrapy.org/community/
2010-09-05 23:38:37 -03:00
.. _screen scraping: http://en.wikipedia.org/wiki/Screen_scraping
.. _web scraping: http://en.wikipedia.org/wiki/Web_scraping
.. _Amazon Associates Web Services: http://aws.amazon.com/associates/
.. _Mininova: http://www.mininova.org
.. _XPath: http://www.w3.org/TR/xpath
.. _XPath reference: http://www.w3.org/TR/xpath
.. _Amazon S3: http://aws.amazon.com/s3/
2011-06-15 11:54:34 -03:00
.. _Sitemaps: http://www.sitemaps.org