Updated scrapy tutorial

--HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40550
2025-02-27 00:24:14 +00:00 · 2008-12-26 14:03:45 +00:00 · 2008-12-26 14:03:45 +00:00 · c7332cd372
commit c7332cd372
parent 867ee20c66
3 changed files with 68 additions and 2 deletions
--- a/sites/scrapy.org/docs/sources/tutorial/index.rst
+++ b/sites/scrapy.org/docs/sources/tutorial/index.rst
@ -7,3 +7,4 @@ Tutorial
   tutorial1
   tutorial2
   tutorial3
   tutorial4
--- a/sites/scrapy.org/docs/sources/tutorial/tutorial3.rst
+++ b/sites/scrapy.org/docs/sources/tutorial/tutorial3.rst
@ -32,11 +32,11 @@ Anyway, having said that, a possible *parse_category* could be::
    def parse_category(self, response):
        items = [] # The item (links to websites) list we're going to return
        hxs = HtmlXPathSelector(response) # The selector we're going to use in order to extract data from the page
-        links = hxs.x('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td//a')
+        links = hxs.x('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td/font')
        for link in links:
            item = ScrapedItem()
-            adaptor_pipe = [adaptors.extract, adaptors.Delist('')]
+            adaptor_pipe = [adaptors.extract, adaptors.Delist(''), adaptors.strip]
            item.set_adaptors({
                'name': adaptor_pipe,
                'url': adaptor_pipe,
@ -62,6 +62,7 @@ Anyway, having said that, a possible *parse_category* could be::
 * An extractor (*extract*), which, as you may imagine, extracts the data from the XPath nodes you provide, and returns it in a list.
 * *Delist*, which joins the list that the previous adaptor returned into a string.
  This adaptor itself is a class, and this is due to the fact that you must specify which delimiter will join the list. That's why we put an instance to this adaptor in the list.
 * *strip*, which (as you may imagine), does the same as the python strings strip method. Cleans up extra spaces before and after the provided string.
 In this case, we used the same adaptors for every attribute, because we're practically doing nothing to the data, just extracting it. But there might be situations were certain attributes
 are handled different than others (in fact, it *will* happen once you scrape more complicated sites with more complicated data).
@ -70,3 +71,4 @@ are handled different than others (in fact, it *will* happen once you scrape mor
 The rest of the code is quite self-explanatory. The *attribute* method sets the item's attributes, and the items themselves are put into a list that we'll return to Scrapy's engine.
 One simple (although important) thing to remember here is that you must always return a list that contains either items, requests, or both, but always inside a list.
 So, we're almost done! Let's now check the last part of the tutorial: :ref:`tutorial4`
--- a/sites/scrapy.org/docs/sources/tutorial/tutorial4.rst
+++ b/sites/scrapy.org/docs/sources/tutorial/tutorial4.rst
@ -0,0 +1,63 @@
 .. _tutorial4:
 =================
 Finishing the job
 =================
 | Well, we've got our project, our spider, and our scraped items. What to do next?
 | It actually depends on what you want to do with the scraped data.
 | In this case, we'll imagine that we want to save this data for storing it in a db later, or just to keep it there.
 | To make it simple, we'll export the scraped items to a CSV file by making use of a useful function that Scrapy brings: *items_to_csv*.
  This simple function takes a file descriptor/filename, and a list of items, and writes their attributes to that file, in CSV format.
 Let's see how would our spider end up looking like after applying this change::
    # -*- coding: utf8 -*-
    from scrapy.xpath import HtmlXPathSelector
    from scrapy.item import ScrapedItem
    from scrapy.contrib import adaptors
    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.link.extractors import RegexLinkExtractor
    from scrapy.utils.misc import items_to_csv
    class GoogleDirectorySpider(CrawlSpider):
        domain_name = 'google.com'
        start_urls = ['http://www.google.com/dirhp']
        rules = (
            Rule(RegexLinkExtractor(allow=('google.com/[A-Z][a-zA-Z_/]+$', ), ),
                'parse_category',
                follow=True,
            ),
        )
        csv_file = open('scraped_items.csv', 'w')
        def parse_category(self, response):
            items = [] # The item (links to websites) list we're going to return
            hxs = HtmlXPathSelector(response) # The selector we're going to use in order to extract data from the page
            links = hxs.x('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td/font')
            for link in links:
                item = ScrapedItem()
                adaptor_pipe = [adaptors.extract, adaptors.Delist(''), adaptors.strip]
                item.set_adaptors({
                    'name': adaptor_pipe,
                    'url': adaptor_pipe,
                    'description': adaptor_pipe,
                })
                item.attribute('name', link.x('a/text()'))
                item.attribute('url', link.x('a/@href'))
                item.attribute('description', link.x('font[2]/text()'))
                items.append(item)
            items_to_csv(self.csv_file, items)
            return items
    SPIDER = GoogleDirectorySpider()
 | With this code, our spider will crawl over Google's directory, and save each link's name, description, and url to a file called 'scraped_items.csv'.
  Cool, huh?
 This is the end of the tutorial. If you'd like to know more about Scrapy and its use, please read the rest of the documentation.