scrapy/docs/topics/spiders.rst

.. _topics-spiders:

=======
Spiders
=======

Spiders are classes which define how a certain site (or domain) will be
scraped, including how to crawl the site and how to extract scraped items from
their pages. In other words, Spiders are the place where you define the custom
behaviour for crawling and parsing pages for a particular site.

For spiders, the scraping cycle goes through something like this:

1. You start by generating the initial Requests to crawl the first URLs, and
   specify a callback function to be called with the response downloaded from
   those requests.

   The first requests to perform are obtained by calling the
   :meth:`~scrapy.spider.BaseSpider.start_requests` method which (by default)
   generates :class:`~scrapy.http.Request` for the URLs specified in the
   :attr:`~scrapy.spider.BaseSpider.start_urls` and the
   :attr:`~scrapy.spider.BaseSpider.parse` method as callback function for the
   Requests.

2. In the callback function you parse the response (web page) and return an
   iterable containing either ScrapedItem or Requests, or both. Those Requests
   will also contain a callback (maybe the same) and will then be followed by
   downloaded by Scrapy and then their response handled to the specified
   callback.

3. In callback functions you parse the page contants, typically using
   :ref:`topics-selectors` (but you can also use BeautifuSoup, lxml or whatever
   mechanism you prefer) and generate items with the parsed data.

4. Finally the items returned from the spider will be typically persisted in
   some Item pipeline.

Even though this cycles applies (more or less) to any kind of spider, there are
different kind of default spiders bundled into Scrapy for different purposes.
We will talk about those types here.

See :ref:`ref-spiders` for the list of default spiders available in Scrapy.
fixed wrong xref name --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40644 2009-01-04 19:03:15 +00:00			`.. _topics-spiders:`
Updated scrapy overview --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40549 2008-12-26 12:21:53 +00:00
doc: several more improvements --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40635 2009-01-03 09:14:52 +00:00			`=======`
Updated scrapy overview --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40549 2008-12-26 12:21:53 +00:00			`Spiders`
doc: several more improvements --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40635 2009-01-03 09:14:52 +00:00			`=======`

splitted spiders doc from link extractor docs, moved the corresponding parts to ref and topics --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40812 2009-01-30 21:53:40 +00:00			`Spiders are classes which define how a certain site (or domain) will be`
			`scraped, including how to crawl the site and how to extract scraped items from`
			`their pages. In other words, Spiders are the place where you define the custom`
			`behaviour for crawling and parsing pages for a particular site.`
changed formatting a little bit and added class references to spiders topic --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40808 2009-01-30 19:12:09 +00:00
splitted spiders doc from link extractor docs, moved the corresponding parts to ref and topics --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40812 2009-01-30 21:53:40 +00:00			`For spiders, the scraping cycle goes through something like this:`
Updated scrapy overview --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40549 2008-12-26 12:21:53 +00:00
splitted spiders doc from link extractor docs, moved the corresponding parts to ref and topics --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40812 2009-01-30 21:53:40 +00:00			`1. You start by generating the initial Requests to crawl the first URLs, and`
			`specify a callback function to be called with the response downloaded from`
			`those requests.`
Updated scrapy overview --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40549 2008-12-26 12:21:53 +00:00
splitted spiders doc from link extractor docs, moved the corresponding parts to ref and topics --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40812 2009-01-30 21:53:40 +00:00			`The first requests to perform are obtained by calling the`
fixed basespider links in topics/spider --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40818 2009-01-31 06:09:50 +00:00			:meth:`~scrapy.spider.BaseSpider.start_requests` method which (by default)
			generates :class:`~scrapy.http.Request` for the URLs specified in the
			:attr:`~scrapy.spider.BaseSpider.start_urls` and the
			:attr:`~scrapy.spider.BaseSpider.parse` method as callback function for the
			`Requests.`
Updated scrapy overview --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40549 2008-12-26 12:21:53 +00:00
splitted spiders doc from link extractor docs, moved the corresponding parts to ref and topics --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40812 2009-01-30 21:53:40 +00:00			`2. In the callback function you parse the response (web page) and return an`
			`iterable containing either ScrapedItem or Requests, or both. Those Requests`
			`will also contain a callback (maybe the same) and will then be followed by`
			`downloaded by Scrapy and then their response handled to the specified`
			`callback.`
Updated scrapy overview --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40549 2008-12-26 12:21:53 +00:00
splitted spiders doc from link extractor docs, moved the corresponding parts to ref and topics --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40812 2009-01-30 21:53:40 +00:00			`3. In callback functions you parse the page contants, typically using`
			:ref:`topics-selectors` (but you can also use BeautifuSoup, lxml or whatever
			`mechanism you prefer) and generate items with the parsed data.`
Updated scrapy overview --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40549 2008-12-26 12:21:53 +00:00
splitted spiders doc from link extractor docs, moved the corresponding parts to ref and topics --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40812 2009-01-30 21:53:40 +00:00			`4. Finally the items returned from the spider will be typically persisted in`
			`some Item pipeline.`
Updated scrapy overview --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40549 2008-12-26 12:21:53 +00:00
splitted spiders doc from link extractor docs, moved the corresponding parts to ref and topics --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40812 2009-01-30 21:53:40 +00:00			`Even though this cycles applies (more or less) to any kind of spider, there are`
			`different kind of default spiders bundled into Scrapy for different purposes.`
			`We will talk about those types here.`
Updated scrapy overview --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40549 2008-12-26 12:21:53 +00:00
splitted spiders doc from link extractor docs, moved the corresponding parts to ref and topics --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40812 2009-01-30 21:53:40 +00:00			See :ref:`ref-spiders` for the list of default spiders available in Scrapy.
Updated scrapy overview --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40549 2008-12-26 12:21:53 +00:00