2009-01-04 19:03:15 +00:00
|
|
|
.. _topics-spiders:
|
2008-12-26 12:21:53 +00:00
|
|
|
|
2009-01-03 09:14:52 +00:00
|
|
|
=======
|
2008-12-26 12:21:53 +00:00
|
|
|
Spiders
|
2009-01-03 09:14:52 +00:00
|
|
|
=======
|
|
|
|
|
2009-01-30 21:53:40 +00:00
|
|
|
Spiders are classes which define how a certain site (or domain) will be
|
|
|
|
scraped, including how to crawl the site and how to extract scraped items from
|
|
|
|
their pages. In other words, Spiders are the place where you define the custom
|
|
|
|
behaviour for crawling and parsing pages for a particular site.
|
2009-01-30 19:12:09 +00:00
|
|
|
|
2009-01-30 21:53:40 +00:00
|
|
|
For spiders, the scraping cycle goes through something like this:
|
2008-12-26 12:21:53 +00:00
|
|
|
|
2009-01-30 21:53:40 +00:00
|
|
|
1. You start by generating the initial Requests to crawl the first URLs, and
|
|
|
|
specify a callback function to be called with the response downloaded from
|
|
|
|
those requests.
|
2008-12-26 12:21:53 +00:00
|
|
|
|
2009-01-30 21:53:40 +00:00
|
|
|
The first requests to perform are obtained by calling the
|
2009-01-31 06:09:50 +00:00
|
|
|
:meth:`~scrapy.spider.BaseSpider.start_requests` method which (by default)
|
|
|
|
generates :class:`~scrapy.http.Request` for the URLs specified in the
|
|
|
|
:attr:`~scrapy.spider.BaseSpider.start_urls` and the
|
|
|
|
:attr:`~scrapy.spider.BaseSpider.parse` method as callback function for the
|
|
|
|
Requests.
|
2008-12-26 12:21:53 +00:00
|
|
|
|
2009-01-30 21:53:40 +00:00
|
|
|
2. In the callback function you parse the response (web page) and return an
|
|
|
|
iterable containing either ScrapedItem or Requests, or both. Those Requests
|
|
|
|
will also contain a callback (maybe the same) and will then be followed by
|
|
|
|
downloaded by Scrapy and then their response handled to the specified
|
|
|
|
callback.
|
2008-12-26 12:21:53 +00:00
|
|
|
|
2009-01-30 21:53:40 +00:00
|
|
|
3. In callback functions you parse the page contants, typically using
|
|
|
|
:ref:`topics-selectors` (but you can also use BeautifuSoup, lxml or whatever
|
|
|
|
mechanism you prefer) and generate items with the parsed data.
|
2008-12-26 12:21:53 +00:00
|
|
|
|
2009-01-30 21:53:40 +00:00
|
|
|
4. Finally the items returned from the spider will be typically persisted in
|
|
|
|
some Item pipeline.
|
2008-12-26 12:21:53 +00:00
|
|
|
|
2009-01-30 21:53:40 +00:00
|
|
|
Even though this cycles applies (more or less) to any kind of spider, there are
|
|
|
|
different kind of default spiders bundled into Scrapy for different purposes.
|
|
|
|
We will talk about those types here.
|
2008-12-26 12:21:53 +00:00
|
|
|
|
2009-01-30 21:53:40 +00:00
|
|
|
See :ref:`ref-spiders` for the list of default spiders available in Scrapy.
|
2008-12-26 12:21:53 +00:00
|
|
|
|