scrapy/sites/scrapy.org/docs/sources/tutorial/tutorial2.rst

.. _tutorial2:

================
Our first spider
================

Ok, the time to write our first spider has come. Make sure that you're standing on your project's directory and run::

    ./scrapy-ctl genspider google_directory google.com

This should create a file called google_directory.py under the *spiders* directory looking like this::

    # -*- coding: utf8 -*-
    import re

    from scrapy.xpath import HtmlXPathSelector
    from scrapy.item import ScrapedItem
    from scrapy.link.extractors import RegexLinkExtractor
    from scrapy.contrib.spiders import CrawlSpider, Rule

    class GoogleDirectorySpider(CrawlSpider):
        domain_name = 'google.com'
        start_urls = ['http://www.google.com/']

        rules = (
            Rule(RegexLinkExtractor(allow=(r'Items/', )), 'parse_item', follow=True),
        )

        def parse_item(self, response):
            xs = HtmlXPathSelector(response)
            i = ScrapedItem()
            #i.attribute('site_id', xs.x('//input[@id="sid"]/@value'))
            #i.attribute('name', xs.x('//div[@id="name"]'))
            #i.attribute('description', xs.x('//div[@id="description"]'))
            return [i]

    SPIDER = GoogleDirectorySpider()

| Now, let's explain a bit what this is all about.
| As you may have noticed, the class that represents the spider is GoogleDirectorySpider, and it inherits from CrawlSpider.
| This means that this spider will crawl over a website given some crawling rules, and parse the response you need according to your patterns, which are defined through the "rules" class attribute.
| This attribute is nothing else but a tuple containing Rule objects. Each Rule defines a specific behaviour the spider will have while crawling the site.
| Rule objects accept the following parameters (the ones between [ ] are optional):

* *link_extractor* - A LinkExtractor instance, which defines the crawling patterns for this Rule.
* *[callback]* - A callback to be called for each link extracted matching the previous link extractor.
* *[cb_kwargs]* - A dictionary of keyword arguments to be passed to the provided callback.
* *[follow]* - A boolean that determines if links are going to be extracted from responses matching this Rule or not.
* *[process_links]* - An optional callback for parsing the extracted links.

| In this case, the Rule would mean something like "search for any url containing the string 'Items/', parse it with the 'parse_item' method, and try to extract more links from it".
| Now, that's an example, so we must make our own Rule for our own spider.
| But before that, we must set our start_urls to our real entry point (which is not actually Google's homepage).

So we replace that line with::

    start_urls = ['http://www.google.com/dirhp']

Now it's the moment to surf that page, and see how can we do to extract data from it.
For this task is almost mandatory that you have Firefox FireBug extension, which allows you to browse through HTML markup in an easy and comfortable way. Otherwise you'd have
to search for tags manually through the body, which can be *very* tedious.

|
|
|

.. image:: scrot1.png

What we see at first sight, is that the directory is divided in categories, which are also divided in subcategories.
However, it seems as if there are more subcategories than the ones being shown in this page, so we'll keep looking...

|
|
|

.. image:: scrot2.png

| Right, this looks much more interesting. Not only subcategories themselves have more subcategories, but they have links to websites (which is in fact the purpose of the directory).
| Now, there's basically one thing to take into account about the previous, and it's the fact that apparently, categories urls are always of the
  kind *http://www.google.com/Category/Subcategory/Another_Subcategory* (which is not very distinctive actually, but possible to use).

So, having said that, a possible rule set for the categories could be::

    rules = (
        Rule(RegexLinkExtractor(allow=('google.com/[A-Z][a-zA-Z_/]+$', ), ),
            'parse_category',
            follow=True,
        ),
    )

| Basically, we told our Rule object to extract links that contain the string 'google.com/' plus any capital letter, plus any letter, the '_' character or the '/'.
| Also, we set our callback 'parse_category' for each of those crawled links, and decided to extract more links from them with follow=True.

Until now, our spider would look something like::

    # -*- coding: utf8 -*-
    from scrapy.xpath import HtmlXPathSelector
    from scrapy.item import ScrapedItem
    from scrapy.link.extractors import RegexLinkExtractor
    from scrapy.contrib.spiders import CrawlSpider, Rule

    class GoogleDirectorySpider(CrawlSpider):
        domain_name = 'google.com'
        start_urls = ['http://www.google.com/dirhp']

        rules = (
            Rule(RegexLinkExtractor(allow=('google.com/[A-Z][a-zA-Z_/]+$', ), ),
                'parse_category',
                follow=True,
            ),
        )

        def parse_category(self, response):
            pass

    SPIDER = GoogleDirectorySpider()


You can try crawling with this little code, by running::

    ./scrapy-ctl crawl google.com

and it will actually work, altough it won't do any parsing, since parse_category is not defined, and that's exactly what we're going to do now.

As you can see in any page containing links to websites in the directory (e.g. http://www.google.com/Top/Arts/Awards/), those links are preceded by a
ranking bar. That could be a nice reference at the moment of selecting an area with an XPath expression.
Let's use FireBug and see how we can identify those bars.

|
|
|

.. image:: scrot3.png

| As you can see, we loaded the page in the Scrapy shell, and tried an XPath expression in order to find the links, which actually worked!
| Basically, that expression would mean, "find any *td* tag who has a descendant tag *a* whose *href* attribute contains the string *#pagerank*"
  (the ranking bar's *td* tag), and then "return the *font* tag of each following *td* sibling that it has" (the link's *td* tag).
| Of course, this may not be the only way to get there (usually there are several expressions that get you to the same place), but it's quite good
  for this case.
| Another approach could be to find any *font* tags that have that grey colour of the links, but I prefer to use the first one because it wouldn't be
  so strange if there were other tags with the same colour.

So, having said that, a possible *parse_category* could be::

    def parse_category(self, response):
        items = [] # The item (links to websites) list we're going to return
        hxs = HtmlXPathSelector(response) # The selector we're going to use in order to extract data from the page
        links = hxs.x('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td//a')

        for link in links:
            item = ScrapedItem()
            adaptor_pipe = [adaptors.extract, adaptors.Delist('')]
            item.set_adaptors({
                'name': adaptor_pipe,
                'url': adaptor_pipe,
                'description': adaptor_pipe,
            })

            item.attribute('name', link.x('a/text()'))
            item.attribute('url', link.x('a/@href'))
            item.attribute('description', link.x('font[2]/text()'))
            items.append(item)

        return items


| Okay, more new stuff :) This time, items!
| Items are the objects we use to represent what you scrape (in this case, links).
| Basically, there are two important things about items: attributes, and adaptors.
| Attributes are nothing else but the places where you store the data you are extracting, which in this case are, the name of the linked website, its url, and a description.
|
| Now, in most cases, you'll have to do certain modifications to this data in order to store it (or do whatever you want to do), and this is done through the adaptors.
| Adaptors are basically a list of functions that receive a value, modify it (or not), and return it.
| In this case we used only two functions for adapting:

* An extractor (*extract*), which, as you may imagine, extracts the data from the XPath nodes you provide, and returns it in a list.
* *Delist*, which joins the list that the previous adaptor returned into a string.
  This adaptor itself is a class, and this is due to the fact that you must specify which delimiter will join the list. That's why we put an instance to this adaptor in the list.
added links and reformated some texts --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40513 2008-12-16 15:53:01 +00:00			`.. _tutorial2:`

Added part of the new scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40498 2008-12-15 13:44:42 +00:00			`================`
			`Our first spider`
			`================`

added links and reformated some texts --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40513 2008-12-16 15:53:01 +00:00			`Ok, the time to write our first spider has come. Make sure that you're standing on your project's directory and run::`
Added part of the new scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40498 2008-12-15 13:44:42 +00:00
Modified Scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40502 2008-12-16 00:39:48 +00:00			`./scrapy-ctl genspider google_directory google.com`
Added part of the new scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40498 2008-12-15 13:44:42 +00:00
Modified Scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40502 2008-12-16 00:39:48 +00:00			`This should create a file called google_directory.py under the spiders directory looking like this::`
Added part of the new scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40498 2008-12-15 13:44:42 +00:00
Modified Scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40502 2008-12-16 00:39:48 +00:00			`# -- coding: utf8 --`
Added part of the new scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40498 2008-12-15 13:44:42 +00:00			`import re`
Modified Scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40502 2008-12-16 00:39:48 +00:00
Added part of the new scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40498 2008-12-15 13:44:42 +00:00			`from scrapy.xpath import HtmlXPathSelector`
			`from scrapy.item import ScrapedItem`
			`from scrapy.link.extractors import RegexLinkExtractor`
			`from scrapy.contrib.spiders import CrawlSpider, Rule`

Modified Scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40502 2008-12-16 00:39:48 +00:00			`class GoogleDirectorySpider(CrawlSpider):`
			`domain_name = 'google.com'`
			`start_urls = ['http://www.google.com/']`
Added part of the new scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40498 2008-12-15 13:44:42 +00:00
			`rules = (`
Modified Scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40502 2008-12-16 00:39:48 +00:00			`Rule(RegexLinkExtractor(allow=(r'Items/', )), 'parse_item', follow=True),`
Added part of the new scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40498 2008-12-15 13:44:42 +00:00			`)`

			`def parse_item(self, response):`
Modified Scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40502 2008-12-16 00:39:48 +00:00			`xs = HtmlXPathSelector(response)`
			`i = ScrapedItem()`
			`#i.attribute('site_id', xs.x('//input[@id="sid"]/@value'))`
			`#i.attribute('name', xs.x('//div[@id="name"]'))`
			`#i.attribute('description', xs.x('//div[@id="description"]'))`
			`return [i]`

			`SPIDER = GoogleDirectorySpider()`

			`\| Now, let's explain a bit what this is all about.`
Updated scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40527 2008-12-17 17:43:20 +00:00			`\| As you may have noticed, the class that represents the spider is GoogleDirectorySpider, and it inherits from CrawlSpider.`
Modified Scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40502 2008-12-16 00:39:48 +00:00			`\| This means that this spider will crawl over a website given some crawling rules, and parse the response you need according to your patterns, which are defined through the "rules" class attribute.`
			`\| This attribute is nothing else but a tuple containing Rule objects. Each Rule defines a specific behaviour the spider will have while crawling the site.`
			`\| Rule objects accept the following parameters (the ones between [ ] are optional):`

			`* link_extractor - A LinkExtractor instance, which defines the crawling patterns for this Rule.`
			`* [callback] - A callback to be called for each link extracted matching the previous link extractor.`
			`* [cb_kwargs] - A dictionary of keyword arguments to be passed to the provided callback.`
			`* [follow] - A boolean that determines if links are going to be extracted from responses matching this Rule or not.`
			`* [process_links] - An optional callback for parsing the extracted links.`

			`\| In this case, the Rule would mean something like "search for any url containing the string 'Items/', parse it with the 'parse_item' method, and try to extract more links from it".`
			`\| Now, that's an example, so we must make our own Rule for our own spider.`
			`\| But before that, we must set our start_urls to our real entry point (which is not actually Google's homepage).`

			`So we replace that line with::`

			`start_urls = ['http://www.google.com/dirhp']`

Updated scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40527 2008-12-17 17:43:20 +00:00			`Now it's the moment to surf that page, and see how can we do to extract data from it.`
Modified Scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40502 2008-12-16 00:39:48 +00:00			`For this task is almost mandatory that you have Firefox FireBug extension, which allows you to browse through HTML markup in an easy and comfortable way. Otherwise you'd have`
			`to search for tags manually through the body, which can be very tedious.`

Updated scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40527 2008-12-17 17:43:20 +00:00			`\|`
			`\|`
			`\|`

Modified the new scrapy tutorial and added some images --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40503 2008-12-16 11:46:19 +00:00			`.. image:: scrot1.png`
Modified Scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40502 2008-12-16 00:39:48 +00:00
			`What we see at first sight, is that the directory is divided in categories, which are also divided in subcategories.`
			`However, it seems as if there are more subcategories than the ones being shown in this page, so we'll keep looking...`

Updated scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40527 2008-12-17 17:43:20 +00:00			`\|`
			`\|`
			`\|`
Modified Scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40502 2008-12-16 00:39:48 +00:00
Updated scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40527 2008-12-17 17:43:20 +00:00			`.. image:: scrot2.png`
Modified Scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40502 2008-12-16 00:39:48 +00:00
Updated scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40527 2008-12-17 17:43:20 +00:00			`\| Right, this looks much more interesting. Not only subcategories themselves have more subcategories, but they have links to websites (which is in fact the purpose of the directory).`
			`\| Now, there's basically one thing to take into account about the previous, and it's the fact that apparently, categories urls are always of the`
			`kind http://www.google.com/Category/Subcategory/Another_Subcategory (which is not very distinctive actually, but possible to use).`
Modified Scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40502 2008-12-16 00:39:48 +00:00
			`So, having said that, a possible rule set for the categories could be::`

			`rules = (`
			`Rule(RegexLinkExtractor(allow=('google.com/[A-Z][a-zA-Z_/]+$', ), ),`
			`'parse_category',`
			`follow=True,`
			`),`
			`)`

			`\| Basically, we told our Rule object to extract links that contain the string 'google.com/' plus any capital letter, plus any letter, the '_' character or the '/'.`
			`\| Also, we set our callback 'parse_category' for each of those crawled links, and decided to extract more links from them with follow=True.`

			`Until now, our spider would look something like::`

			`# -- coding: utf8 --`
			`from scrapy.xpath import HtmlXPathSelector`
			`from scrapy.item import ScrapedItem`
			`from scrapy.link.extractors import RegexLinkExtractor`
			`from scrapy.contrib.spiders import CrawlSpider, Rule`

			`class GoogleDirectorySpider(CrawlSpider):`
			`domain_name = 'google.com'`
			`start_urls = ['http://www.google.com/dirhp']`

			`rules = (`
			`Rule(RegexLinkExtractor(allow=('google.com/[A-Z][a-zA-Z_/]+$', ), ),`
			`'parse_category',`
			`follow=True,`
			`),`
			`)`

			`def parse_category(self, response):`
			`pass`

			`SPIDER = GoogleDirectorySpider()`


			`You can try crawling with this little code, by running::`

			`./scrapy-ctl crawl google.com`

			`and it will actually work, altough it won't do any parsing, since parse_category is not defined, and that's exactly what we're going to do now.`

			`As you can see in any page containing links to websites in the directory (e.g. http://www.google.com/Top/Arts/Awards/), those links are preceded by a`
			`ranking bar. That could be a nice reference at the moment of selecting an area with an XPath expression.`
Modified the new scrapy tutorial and added some images --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40503 2008-12-16 11:46:19 +00:00			`Let's use FireBug and see how we can identify those bars.`

Updated scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40527 2008-12-17 17:43:20 +00:00			`\|`
			`\|`
			`\|`
Modified the new scrapy tutorial and added some images --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40503 2008-12-16 11:46:19 +00:00
Updated scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40527 2008-12-17 17:43:20 +00:00			`.. image:: scrot3.png`

			`\| As you can see, we loaded the page in the Scrapy shell, and tried an XPath expression in order to find the links, which actually worked!`
			`\| Basically, that expression would mean, "find any td tag who has a descendant tag a whose href attribute contains the string #pagerank"`
			`(the ranking bar's td tag), and then "return the font tag of each following td sibling that it has" (the link's td tag).`
			`\| Of course, this may not be the only way to get there (usually there are several expressions that get you to the same place), but it's quite good`
			`for this case.`
			`\| Another approach could be to find any font tags that have that grey colour of the links, but I prefer to use the first one because it wouldn't be`
			`so strange if there were other tags with the same colour.`

			`So, having said that, a possible parse_category could be::`
Modified Scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40502 2008-12-16 00:39:48 +00:00
			`def parse_category(self, response):`
Modified the new scrapy tutorial and added some images --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40503 2008-12-16 11:46:19 +00:00			`items = [] # The item (links to websites) list we're going to return`
			`hxs = HtmlXPathSelector(response) # The selector we're going to use in order to extract data from the page`
Modified Scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40502 2008-12-16 00:39:48 +00:00			`links = hxs.x('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td//a')`

			`for link in links:`
			`item = ScrapedItem()`
Updated scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40527 2008-12-17 17:43:20 +00:00			`adaptor_pipe = [adaptors.extract, adaptors.Delist('')]`
			`item.set_adaptors({`
			`'name': adaptor_pipe,`
			`'url': adaptor_pipe,`
			`'description': adaptor_pipe,`
			`})`

			`item.attribute('name', link.x('a/text()'))`
			`item.attribute('url', link.x('a/@href'))`
			`item.attribute('description', link.x('font[2]/text()'))`
Modified Scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40502 2008-12-16 00:39:48 +00:00			`items.append(item)`

			`return items`
Added part of the new scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40498 2008-12-15 13:44:42 +00:00
Updated scrapy tutorial --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40527 2008-12-17 17:43:20 +00:00
			`\| Okay, more new stuff :) This time, items!`
			`\| Items are the objects we use to represent what you scrape (in this case, links).`
			`\| Basically, there are two important things about items: attributes, and adaptors.`
			`\| Attributes are nothing else but the places where you store the data you are extracting, which in this case are, the name of the linked website, its url, and a description.`
			`\|`
			`\| Now, in most cases, you'll have to do certain modifications to this data in order to store it (or do whatever you want to do), and this is done through the adaptors.`
			`\| Adaptors are basically a list of functions that receive a value, modify it (or not), and return it.`
			`\| In this case we used only two functions for adapting:`

			`* An extractor (extract), which, as you may imagine, extracts the data from the XPath nodes you provide, and returns it in a list.`
			`* Delist, which joins the list that the previous adaptor returned into a string.`
			`This adaptor itself is a class, and this is due to the fact that you must specify which delimiter will join the list. That's why we put an instance to this adaptor in the list.`