1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-03-01 04:49:25 +00:00
elpolilla c7332cd372 Updated scrapy tutorial
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40550
2008-12-26 14:03:45 +00:00

64 lines
2.6 KiB
ReStructuredText

.. _tutorial4:
=================
Finishing the job
=================
| Well, we've got our project, our spider, and our scraped items. What to do next?
| It actually depends on what you want to do with the scraped data.
| In this case, we'll imagine that we want to save this data for storing it in a db later, or just to keep it there.
| To make it simple, we'll export the scraped items to a CSV file by making use of a useful function that Scrapy brings: *items_to_csv*.
This simple function takes a file descriptor/filename, and a list of items, and writes their attributes to that file, in CSV format.
Let's see how would our spider end up looking like after applying this change::
# -*- coding: utf8 -*-
from scrapy.xpath import HtmlXPathSelector
from scrapy.item import ScrapedItem
from scrapy.contrib import adaptors
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.link.extractors import RegexLinkExtractor
from scrapy.utils.misc import items_to_csv
class GoogleDirectorySpider(CrawlSpider):
domain_name = 'google.com'
start_urls = ['http://www.google.com/dirhp']
rules = (
Rule(RegexLinkExtractor(allow=('google.com/[A-Z][a-zA-Z_/]+$', ), ),
'parse_category',
follow=True,
),
)
csv_file = open('scraped_items.csv', 'w')
def parse_category(self, response):
items = [] # The item (links to websites) list we're going to return
hxs = HtmlXPathSelector(response) # The selector we're going to use in order to extract data from the page
links = hxs.x('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td/font')
for link in links:
item = ScrapedItem()
adaptor_pipe = [adaptors.extract, adaptors.Delist(''), adaptors.strip]
item.set_adaptors({
'name': adaptor_pipe,
'url': adaptor_pipe,
'description': adaptor_pipe,
})
item.attribute('name', link.x('a/text()'))
item.attribute('url', link.x('a/@href'))
item.attribute('description', link.x('font[2]/text()'))
items.append(item)
items_to_csv(self.csv_file, items)
return items
SPIDER = GoogleDirectorySpider()
| With this code, our spider will crawl over Google's directory, and save each link's name, description, and url to a file called 'scraped_items.csv'.
Cool, huh?
This is the end of the tutorial. If you'd like to know more about Scrapy and its use, please read the rest of the documentation.