scrapy/sep/sep-016.rst

=======  =============================
SEP      16
Title    Leg Spider
Author   Insophia Team
Created  2010-06-03
Status   Superseded  by :doc:`sep-018`
=======  =============================

===================
SEP-016: Leg Spider
===================

This SEP introduces a new kind of Spider called ``LegSpider`` which provides
modular functionality which can be plugged to different spiders.

Rationale
=========

The purpose of Leg Spiders is to define an architecture for building spiders
based on smaller well-tested components (aka. Legs) that can be combined to
achieve the desired functionality. These reusable components will benefit all
Scrapy users by building a repository of well-tested components (legs) that can
be shared among different spiders and projects. Some of them will come bundled
with Scrapy.

The Legs themselves can be also combined with sub-legs, in a hierarchical
fashion. Legs are also spiders themselves, hence the name "Leg Spider".

``LegSpider`` API
=================

A ``LegSpider`` is a ``BaseSpider`` subclass that adds the following attributes and methods:

- ``legs``
   - legs composing this spider
- ``process_response(response)``
   - Process a (downloaded) response and return a list of requests and items
- ``process_request(request)``
   - Process a request after it has been extracted and before returning it from
     the spider
- ``process_item(item)``
   - Process an item after it has been extracted and before returning it from
     the spider
- ``set_spider()``
   - Defines the main spider associated with this Leg Spider, which is often
     used to configure the Leg Spider behavior.

How Leg Spiders work
====================

1. Each Leg Spider has zero or many Leg Spiders associated with it. When a
   response arrives, the Leg Spider process it with its ``process_response``
   method and also the ``process_response`` method of all its "sub leg
   spiders". Finally, the output of all of them is combined to produce the
   final aggregated output.
2. Each element of the aggregated output of ``process_response`` is processed
   with either ``process_item`` or ``process_request`` before being returned
   from the spider. Similar to ``process_response``, each item/request is
   processed with all ``process_{request,item``} of the leg spiders composing
   the spider, and also with those of the spider itself.

Leg Spider examples
===================

Regex (HTML) Link Extractor
---------------------------

A typical application of LegSpider's is to build Link Extractors. For example:

::

   #!python
   class RegexHtmlLinkExtractor(LegSpider):

       def process_response(self, response):
           if isinstance(response, HtmlResponse):
               allowed_regexes = self.spider.url_regexes_to_follow
               # extract urls to follow using allowed_regexes
               return [Request(x) for x in urls_to_follow]

   class MySpider(LegSpider):

       legs = [RegexHtmlLinkExtractor()]
       url_regexes_to_follow = ['/product.php?.*']

       def parse_response(self, response):
           # parse response and extract items
           return items

RSS2 link extractor
-------------------

This is a Leg Spider that can be used for following links from RSS2 feeds.

::

   #!python
   class Rss2LinkExtractor(LegSpider):

       def process_response(self, response):
           if response.headers.get('Content-type') 'application/rss+xml':
               xs = XmlXPathSelector(response)
               urls = xs.select("//item/link/text()").extract()
               return [Request(x) for x in urls]

Callback dispatcher based on rules
----------------------------------

Another example could be to build a callback dispatcher based on rules:

::

   #!python
   class CallbackRules(LegSpider):

       def __init__(self, *a, **kw):
           super(CallbackRules, self).__init__(*a, **kw)
           for regex, method_name in self.spider.callback_rules.items():
               r = re.compile(regex)
               m = getattr(self.spider, method_name, None)
               if m:
                   self._rules[r] = m

       def process_response(self, response):
           for regex, method in self._rules.items():
               m = regex.search(response.url)
               if m:
                   return method(response)
           return []

   class MySpider(LegSpider):

       legs = [CallbackRules()]
       callback_rules = {
           '/product.php.*': 'parse_product',
           '/category.php.*': 'parse_category',
       }

       def parse_product(self, response):
           # parse response and populate item
           return item

URL Canonicalizers
------------------

Another example could be for building URL canonicalizers:

::

   #!python
   class CanonializeUrl(LegSpider):

       def process_request(self, request):
           curl = canonicalize_url(request.url, rules=self.spider.canonicalization_rules)
           return request.replace(url=curl)

   class MySpider(LegSpider):

       legs = [CanonicalizeUrl()]
       canonicalization_rules = ['sort-query-args', 'normalize-percent-encoding', ...]

       # ...

Setting item identifier
-----------------------

Another example could be for setting a unique identifier to items, based on
certain fields:

::

   #!python
   class ItemIdSetter(LegSpider):

       def process_item(self, item):
           id_field = self.spider.id_field
           id_fields_to_hash = self.spider.id_fields_to_hash
           item[id_field] = make_hash_based_on_fields(item, id_fields_to_hash)
           return item

   class MySpider(LegSpider):

       legs = [ItemIdSetter()]
       id_field = 'guid'
       id_fields_to_hash = ['supplier_name', 'supplier_id']

       def process_response(self, item):
           # extract item from response
           return item

Combining multiple leg spiders
------------------------------

Here's an example that combines functionality from multiple leg spiders:

::

   #!python
   class MySpider(LegSpider):

       legs = [RegexLinkExtractor(), ParseRules(), CanonicalizeUrl(), ItemIdSetter()]

       url_regexes_to_follow = ['/product.php?.*']

       parse_rules = {
           '/product.php.*': 'parse_product',
           '/category.php.*': 'parse_category',
       }

       canonicalization_rules = ['sort-query-args', 'normalize-percent-encoding', ...]

       id_field = 'guid'
       id_fields_to_hash = ['supplier_name', 'supplier_id']

       def process_product(self, item):
           # extract item from response
           return item

       def process_category(self, item):
           # extract item from response
           return item

Leg Spiders vs Spider middlewares
=================================

A common question that would arise is when one should use Leg Spiders and when
to use Spider middlewares. Leg Spiders functionality is meant to implement
spider-specific functionality, like link extraction which has custom rules per
spider. Spider middlewares, on the other hand, are meant to implement global
functionality.

When not to use Leg Spiders
===========================

Leg Spiders are not a silver bullet to implement all kinds of spiders, so it's
important to keep in mind their scope and limitations, such as:

- Leg Spiders can't filter duplicate requests, since they don't have access to
  all requests at the same time. This functionality should be done in a spider
  or scheduler middleware.
- Leg Spiders are meant to be used for spiders whose behavior (requests & items
  to extract) depends only on the current page and not previously crawled pages
  (aka. "context-free spiders"). If your spider has some custom logic with
  chained downloads (for example, multi-page items) then Leg Spiders may not be
  a good fit.

``LegSpider`` proof-of-concept implementation
=============================================

Here's a proof-of-concept implementation of ``LegSpider``:

::

   #!python
   from scrapy.http import Request
   from scrapy.item import BaseItem
   from scrapy.spider import BaseSpider
   from scrapy.utils.spider import iterate_spider_output


   class LegSpider(BaseSpider):
       """A spider made of legs"""

       legs = []

       def __init__(self, *args, **kwargs):
           super(LegSpider, self).__init__(*args, **kwargs)
           self._legs = [self] + self.legs[:]
           for l in self._legs:
               l.set_spider(self)

       def parse(self, response):
           res = self._process_response(response)
           for r in res:
               if isinstance(r, BaseItem):
                   yield self._process_item(r)
               else:
                   yield self._process_request(r)

       def process_response(self, response):
           return []

       def process_request(self, request):
           return request

       def process_item(self, item):
           return item

       def set_spider(self, spider):
           self.spider = spider

       def _process_response(self, response):
           res = []
           for l in self._legs:
               res.extend(iterate_spider_output(l.process_response(response)))
           return res

       def _process_request(self, request):
           for l in self._legs:
               request = l.process_request(request)
           return request

       def _process_item(self, item):
           for l in self._legs:
               item = l.process_item(item)
           return item