= SEP-016: Leg Spider = [[PageOutline(2-5,Contents)]] ||'''SEP:'''||16|| ||'''Title:'''||Leg Spider|| ||'''Author:'''||Insophia Team|| ||'''Created:'''||2010-06-03|| ||'''Status'''||Superseded by [wiki:SEP-018]|| == Introduction == This SEP introduces a new kind of Spider called {{{LegSpider}}} which provides modular functionality which can be plugged to different spiders. == Rationale == The purpose of Leg Spiders is to define an architecture for building spiders based on smaller well-tested components (aka. Legs) that can be combined to achieve the desired functionality. These reusable components will benefit all Scrapy users by building a repository of well-tested components (legs) that can be shared among different spiders and projects. Some of them will come bundled with Scrapy. The Legs themselves can be also combined with sub-legs, in a hierarchical fashion. Legs are also spiders themselves, hence the name "Leg Spider". == {{{LegSpider}}} API == A {{{LegSpider}}} is a {{{BaseSpider}}} subclass that adds the following attributes and methods: * {{{legs}}} * legs composing this spider * {{{process_response(response)}}} * Process a (downloaded) response and return a list of requests and items * {{{process_request(request)}}} * Process a request after it has been extracted and before returning it from the spider * {{{process_item(item)}}} * Process an item after it has been extracted and before returning it from the spider * {{{set_spider()}}} * Defines the main spider associated with this Leg Spider, which is often used to configure the Leg Spider behavior. == How Leg Spiders work == 1. Each Leg Spider has zero or many Leg Spiders associated with it. When a response arrives, the Leg Spider process it with its {{{process_response}}} method and also the {{{process_response}}} method of all its "sub leg spiders". Finally, the output of all of them is combined to produce the final aggregated output. 2. Each element of the aggregated output of {{{process_response}}} is processed with either {{{process_item}}} or {{{process_request}}} before being returned from the spider. Similar to {{{process_response}}}, each item/request is processed with all {{{process_{request,item}}}} of the leg spiders composing the spider, and also with those of the spider itself. == Leg Spider examples == === Regex (HTML) Link Extractor === A typical application of LegSpider's is to build Link Extractors. For example: {{{ #!python class RegexHtmlLinkExtractor(LegSpider): def process_response(self, response): if isinstance(response, HtmlResponse): allowed_regexes = self.spider.url_regexes_to_follow # extract urls to follow using allowed_regexes return [Request(x) for x in urls_to_follow] class MySpider(LegSpider): legs = [RegexHtmlLinkExtractor()] url_regexes_to_follow = ['/product.php?.*'] def parse_response(self, response): # parse response and extract items return items }}} === RSS2 link extractor === This is a Leg Spider that can be used for following links from RSS2 feeds. {{{ #!python class Rss2LinkExtractor(LegSpider): def process_response(self, response): if response.headers.get('Content-type') == 'application/rss+xml': xs = XmlXPathSelector(response) urls = xs.select("//item/link/text()").extract() return [Request(x) for x in urls] }}} === Callback dispatcher based on rules === Another example could be to build a callback dispatcher based on rules: {{{ #!python class CallbackRules(LegSpider): def __init__(self, *a, **kw): super(CallbackRules, self).__init__(*a, **kw) for regex, method_name in self.spider.callback_rules.items(): r = re.compile(regex) m = getattr(self.spider, method_name, None) if m: self._rules[r] = m def process_response(self, response): for regex, method in self._rules.items(): m = regex.search(response.url) if m: return method(response) return [] class MySpider(LegSpider): legs = [CallbackRules()] callback_rules = { '/product.php.*': 'parse_product', '/category.php.*': 'parse_category', } def parse_product(self, response): # parse reponse and populate item return item }}} === URL Canonicalizers === Another example could be for building URL canonicalizers: {{{ #!python class CanonializeUrl(LegSpider): def process_request(self, request): curl = canonicalize_url(request.url, rules=self.spider.canonicalization_rules) return request.replace(url=curl) class MySpider(LegSpider): legs = [CanonicalizeUrl()] canonicalization_rules = ['sort-query-args', 'normalize-percent-encoding', ...] # ... }}} === Setting item identifier === Another example could be for setting a unique identifier to items, based on certain fields: {{{ #!python class ItemIdSetter(LegSpider): def process_item(self, item): id_field = self.spider.id_field id_fields_to_hash = self.spider.id_fields_to_hash item[id_field] = make_hash_based_on_fields(item, id_fields_to_hash) return item class MySpider(LegSpider): legs = [ItemIdSetter()] id_field = 'guid' id_fields_to_hash = ['supplier_name', 'supplier_id'] def process_response(self, item): # extract item from response return item }}} === Combining multiple leg spiders === Here's an example that combines functionality from multiple leg spiders: {{{ #!python class MySpider(LegSpider): legs = [RegexLinkExtractor(), ParseRules(), CanonicalizeUrl(), ItemIdSetter()] url_regexes_to_follow = ['/product.php?.*'] parse_rules = { '/product.php.*': 'parse_product', '/category.php.*': 'parse_category', } canonicalization_rules = ['sort-query-args', 'normalize-percent-encoding', ...] id_field = 'guid' id_fields_to_hash = ['supplier_name', 'supplier_id'] def process_product(self, item): # extract item from response return item def process_category(self, item): # extract item from response return item }}} == Leg Spiders vs Spider middlewares == A common question that would arise is when one should use Leg Spiders and when to use Spider middlewares. Leg Spiders functionality is meant to implement spider-specific functionality, like link extraction which has custom rules per spider. Spider middlewares, on the other hand, are meant to implement global functionality. == When not to use Leg Spiders == Leg Spiders are not a silver bullet to implement all kinds of spiders, so it's important to keep in mind their scope and limitations, such as: * Leg Spiders can't filter duplicate requests, since they don't have access to all requests at the same time. This functionality should be done in a spider or scheduler middleware. * Leg Spiders are meant to be used for spiders whose behavior (requests & items to extract) depends only on the current page and not previously crawled pages (aka. "context-free spiders"). If your spider has some custom logic with chained downloads (for example, multi-page items) then Leg Spiders may not be a good fit. == {{{LegSpider}}} proof-of-concept implementation == Here's a proof-of-concept implementation of {{{LegSpider}}}: {{{ #!python from scrapy.http import Request from scrapy.item import BaseItem from scrapy.spider import BaseSpider from scrapy.utils.spider import iterate_spider_output class LegSpider(BaseSpider): """A spider made of legs""" legs = [] def __init__(self, *args, **kwargs): super(LegSpider, self).__init__(*args, **kwargs) self._legs = [self] + self.legs[:] for l in self._legs: l.set_spider(self) def parse(self, response): res = self._process_response(response) for r in res: if isinstance(r, BaseItem): yield self._process_item(r) else: yield self._process_request(r) def process_response(self, response): return [] def process_request(self, request): return request def process_item(self, item): return item def set_spider(self, spider): self.spider = spider def _process_response(self, response): res = [] for l in self._legs: res.extend(iterate_spider_output(l.process_response(response))) return res def _process_request(self, request): for l in self._legs: request = l.process_request(request) return request def _process_item(self, item): for l in self._legs: item = l.process_item(item) return item }}}