.. _topics-spiders:

=======
Spiders
=======

Spiders are classes which define how a certain site (or domain) will be
scraped, including how to crawl the site and how to extract scraped items from
their pages. In other words, Spiders are the place where you define the custom
behaviour for crawling and parsing pages for a particular site.

For spiders, the scraping cycle goes through something like this:

1. You start by generating the initial Requests to crawl the first URLs, and
   specify a callback function to be called with the response downloaded from
   those requests.

   The first requests to perform are obtained by calling the
   :meth:`~scrapy.spider.BaseSpider.start_requests` method which (by default)
   generates :class:`~scrapy.http.Request` for the URLs specified in the
   :attr:`~scrapy.spider.BaseSpider.start_urls` and the
   :attr:`~scrapy.spider.BaseSpider.parse` method as callback function for the
   Requests.

2. In the callback function you parse the response (web page) and return an
   iterable containing either ScrapedItem or Requests, or both. Those Requests
   will also contain a callback (maybe the same) and will then be followed by
   downloaded by Scrapy and then their response handled to the specified
   callback.

3. In callback functions you parse the page contants, typically using
   :ref:`topics-selectors` (but you can also use BeautifuSoup, lxml or whatever
   mechanism you prefer) and generate items with the parsed data.

4. Finally the items returned from the spider will be typically persisted in
   some Item pipeline.

Even though this cycles applies (more or less) to any kind of spider, there are
different kind of default spiders bundled into Scrapy for different purposes.
We will talk about those types here.

See :ref:`ref-spiders` for the list of default spiders available in Scrapy.