Modified Scrapy tutorial

--HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40502
2025-02-26 20:23:53 +00:00 · 2008-12-16 00:39:48 +00:00 · 2008-12-16 00:39:48 +00:00 · 14b2a4ff81
commit 14b2a4ff81
parent f9931518aa
2 changed files with 121 additions and 19 deletions
--- a/sites/scrapy.org/docs/sources/tutorial2/tutorial1.rst
+++ b/sites/scrapy.org/docs/sources/tutorial2/tutorial1.rst
@ -2,17 +2,17 @@
 Setting everything up
 =====================

-| In this tutorial, we'll teach you how to scrape http://www.dmoz.org, a websites directory.
+| In this tutorial, we'll teach you how to scrape http://www.google.com/dirhp; Google's web directory.
 | We'll assume that you've checked and fulfilled the requirements specified in the Download page, and that Scrapy is already installed in your system.

 For starting a new project, enter the directory where you'd like your project to be located, and run::

-    scrapy-admin.py startproject dmoz
+    scrapy-admin.py startproject google

-As long as Scrapy is well installed and the path is set, this should create a directory called "dmoz"
+As long as Scrapy is well installed and the path is set, this should create a directory called "google"
 containing the following files:

-* *scrapy-ctl.py* - the project's control script. It's used for running the different tasks (like "crawl" and "parse"). We'll talk more about this later.
+* *scrapy-ctl.py* - the project's control script. It's used for running the different tasks (like "genspider", "crawl" and "parse"). We'll talk more about this later.
 * *scrapy_settings.py* - the project's settings file.
 * *items.py* - were you define the different kinds of items you're going to scrape.
 * *spiders* - directory where you'll later place your spiders.
--- a/sites/scrapy.org/docs/sources/tutorial2/tutorial2.rst
+++ b/sites/scrapy.org/docs/sources/tutorial2/tutorial2.rst
@ -3,36 +3,138 @@ Our first spider
 ================

 | Ok, the time to write our first spider has come.
-| Make sure you're standing on your project's directory and run:
+| Make sure that you're standing on your project's directory and run:

 ::

-    ./scrapy-ctl genspider dmoz dmoz.org
+    ./scrapy-ctl genspider google_directory google.com

-This should create a file called dmoz.py under the *spiders* directory looking similar to this::
+This should create a file called google_directory.py under the *spiders* directory looking like this::

-    # -*- coding: utf8 -*-<
+    # -*- coding: utf8 -*-
    import re
+
    from scrapy.xpath import HtmlXPathSelector
    from scrapy.item import ScrapedItem
    from scrapy.link.extractors import RegexLinkExtractor
    from scrapy.contrib.spiders import CrawlSpider, Rule

-    class DmozSpider(CrawlSpider):
-        domain_name = "dmoz.org"
-        start_urls = ['http://www.dmoz.org/']
+    class GoogleDirectorySpider(CrawlSpider):
+        domain_name = 'google.com'
+        start_urls = ['http://www.google.com/']

        rules = (
-            Rule(RegexLinkExtractor(allow=(r'Items/', ), 'parse_item', follow=True)
+            Rule(RegexLinkExtractor(allow=(r'Items/', )), 'parse_item', follow=True),
        )

        def parse_item(self, response):
-            #xs = HtmlXPathSelector(response)
-            #i = ScrapedItem()
-            #i.attribute('site_id', xs.x("//input[@id="sid"]/@value"))
-            #i.attribute('name', xs.x("//div[@id='name']"))
-            #i.attribute('description', xs.x("//div[@id='description']"))
-            #return [i]
+            xs = HtmlXPathSelector(response)
+            i = ScrapedItem()
+            #i.attribute('site_id', xs.x('//input[@id="sid"]/@value'))
+            #i.attribute('name', xs.x('//div[@id="name"]'))
+            #i.attribute('description', xs.x('//div[@id="description"]'))
+            return [i]

-    SPIDER = DmozSpider()
+    SPIDER = GoogleDirectorySpider()
+
+| Now, let's explain a bit what this is all about.
+| As you may have noticed, the class that represents the spider is GoogleDirectorySpider, and it heredates from CrawlSpider.
+| This means that this spider will crawl over a website given some crawling rules, and parse the response you need according to your patterns, which are defined through the "rules" class attribute.
+| This attribute is nothing else but a tuple containing Rule objects. Each Rule defines a specific behaviour the spider will have while crawling the site.
+| Rule objects accept the following parameters (the ones between [ ] are optional):
+
+* *link_extractor* - A LinkExtractor instance, which defines the crawling patterns for this Rule.
+* *[callback]* - A callback to be called for each link extracted matching the previous link extractor.
+* *[cb_kwargs]* - A dictionary of keyword arguments to be passed to the provided callback.
+* *[follow]* - A boolean that determines if links are going to be extracted from responses matching this Rule or not.
+* *[process_links]* - An optional callback for parsing the extracted links.
+
+| In this case, the Rule would mean something like "search for any url containing the string 'Items/', parse it with the 'parse_item' method, and try to extract more links from it".
+| Now, that's an example, so we must make our own Rule for our own spider.
+| But before that, we must set our start_urls to our real entry point (which is not actually Google's homepage).
+
+So we replace that line with::
+
+    start_urls = ['http://www.google.com/dirhp']
+
+Now it's the moment to surf that page, and see how we can do to extract data from it.
+For this task is almost mandatory that you have Firefox FireBug extension, which allows you to browse through HTML markup in an easy and comfortable way. Otherwise you'd have
+to search for tags manually through the body, which can be *very* tedious.
+
+[IMG]
+
+What we see at first sight, is that the directory is divided in categories, which are also divided in subcategories.
+However, it seems as if there are more subcategories than the ones being shown in this page, so we'll keep looking...
+
+[IMG]
+
+Hmmkay... The only new thing here is that there are lots of subcategories, let's see what's inside them...
+
+[IMG]
+
+Right, this looks more interesting. Not only subcategories themselves have more subcategories, but they have links to websites (which is in fact the purpose of the directory).
+Now, there's basically one thing to take into account about the previous, and it's the fact that apparently, categories urls are always of the kind
+*http://www.google.com/Category/Subcategory/Another_Subcategory* (which is not very distinctive actually, but possible to use).
+
+So, having said that, a possible rule set for the categories could be::
+
+    rules = (
+        Rule(RegexLinkExtractor(allow=('google.com/[A-Z][a-zA-Z_/]+$', ), ),
+            'parse_category',
+            follow=True,
+        ),
+    )
+
+| Basically, we told our Rule object to extract links that contain the string 'google.com/' plus any capital letter, plus any letter, the '_' character or the '/'.
+| Also, we set our callback 'parse_category' for each of those crawled links, and decided to extract more links from them with follow=True.
+
+Until now, our spider would look something like::
+
+    # -*- coding: utf8 -*-
+    from scrapy.xpath import HtmlXPathSelector
+    from scrapy.item import ScrapedItem
+    from scrapy.link.extractors import RegexLinkExtractor
+    from scrapy.contrib.spiders import CrawlSpider, Rule
+
+    class GoogleDirectorySpider(CrawlSpider):
+        domain_name = 'google.com'
+        start_urls = ['http://www.google.com/dirhp']
+
+        rules = (
+            Rule(RegexLinkExtractor(allow=('google.com/[A-Z][a-zA-Z_/]+$', ), ),
+                'parse_category',
+                follow=True,
+            ),
+        )
+
+        def parse_category(self, response):
+            pass
+
+    SPIDER = GoogleDirectorySpider()
+
+
+You can try crawling with this little code, by running::
+
+    ./scrapy-ctl crawl google.com
+
+and it will actually work, altough it won't do any parsing, since parse_category is not defined, and that's exactly what we're going to do now.
+
+As you can see in any page containing links to websites in the directory (e.g. http://www.google.com/Top/Arts/Awards/), those links are preceded by a
+ranking bar. That could be a nice reference at the moment of selecting an area with an XPath expression.
+So, a possible *parse_category* could be::
+
+    def parse_category(self, response):
+        items = []
+        hxs = HtmlXPathSelector(response)
+        links = hxs.x('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td//a')
+
+        for link in links:
+            item = ScrapedItem()
+            item.set_attrib_adaptors('name', [adaptors.extract, adaptors.Delist('')])
+            item.set_attrib_adaptors('url', [adaptors.extract, adaptors.Delist('')])
+            item.attribute('name', link.x('text()'))
+            item.attribute('url', link.x('@href'))
+            items.append(item)
+
+        return items