| This means that this spider will crawl over a website given some crawling rules, and parse the response you need according to your patterns, which are defined through the "rules" class attribute.
| This attribute is nothing else but a tuple containing Rule objects. Each Rule defines a specific behaviour the spider will have while crawling the site.
| Rule objects accept the following parameters (the ones between [ ] are optional):
**link_extractor* - A LinkExtractor instance, which defines the crawling patterns for this Rule.
**[callback]* - A callback to be called for each link extracted matching the previous link extractor.
**[cb_kwargs]* - A dictionary of keyword arguments to be passed to the provided callback.
**[follow]* - A boolean that determines if links are going to be extracted from responses matching this Rule or not.
**[process_links]* - An optional callback for parsing the extracted links.
| In this case, the Rule would mean something like "search for any url containing the string 'Items/', parse it with the 'parse_item' method, and try to extract more links from it".
| Now, that's an example, so we must make our own Rule for our own spider.
| But before that, we must set our start_urls to our real entry point (which is not actually Google's homepage).
For this task is almost mandatory that you have Firefox FireBug extension, which allows you to browse through HTML markup in an easy and comfortable way. Otherwise you'd have
to search for tags manually through the body, which can be *very* tedious.
| Right, this looks much more interesting. Not only subcategories themselves have more subcategories, but they have links to websites (which is in fact the purpose of the directory).
| Now, there's basically one thing to take into account about the previous, and it's the fact that apparently, categories urls are always of the
kind *http://www.google.com/Category/Subcategory/Another_Subcategory* (which is not very distinctive actually, but possible to use).
| Basically, we told our Rule object to extract links that contain the string 'google.com/' plus any capital letter, plus any letter, the '_' character or the '/'.
| Also, we set our callback 'parse_category' for each of those crawled links, and decided to extract more links from them with follow=True.
Until now, our spider would look something like::
# -*- coding: utf8 -*-
from scrapy.xpath import HtmlXPathSelector
from scrapy.item import ScrapedItem
from scrapy.link.extractors import RegexLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
| Items are the objects we use to represent what you scrape (in this case, links).
| Basically, there are two important things about items: attributes, and adaptors.
| Attributes are nothing else but the places where you store the data you are extracting, which in this case are, the name of the linked website, its url, and a description.
|
| Now, in most cases, you'll have to do certain modifications to this data in order to store it (or do whatever you want to do), and this is done through the adaptors.
| Adaptors are basically a list of functions that receive a value, modify it (or not), and return it.
| In this case we used only two functions for adapting:
* An extractor (*extract*), which, as you may imagine, extracts the data from the XPath nodes you provide, and returns it in a list.
**Delist*, which joins the list that the previous adaptor returned into a string.
This adaptor itself is a class, and this is due to the fact that you must specify which delimiter will join the list. That's why we put an instance to this adaptor in the list.