1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-26 17:24:38 +00:00

Updated scrapy tutorial

--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40550
This commit is contained in:
elpolilla 2008-12-26 14:03:45 +00:00
parent 867ee20c66
commit c7332cd372
3 changed files with 68 additions and 2 deletions

View File

@ -7,3 +7,4 @@ Tutorial
tutorial1
tutorial2
tutorial3
tutorial4

View File

@ -32,11 +32,11 @@ Anyway, having said that, a possible *parse_category* could be::
def parse_category(self, response):
items = [] # The item (links to websites) list we're going to return
hxs = HtmlXPathSelector(response) # The selector we're going to use in order to extract data from the page
links = hxs.x('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td//a')
links = hxs.x('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td/font')
for link in links:
item = ScrapedItem()
adaptor_pipe = [adaptors.extract, adaptors.Delist('')]
adaptor_pipe = [adaptors.extract, adaptors.Delist(''), adaptors.strip]
item.set_adaptors({
'name': adaptor_pipe,
'url': adaptor_pipe,
@ -62,6 +62,7 @@ Anyway, having said that, a possible *parse_category* could be::
* An extractor (*extract*), which, as you may imagine, extracts the data from the XPath nodes you provide, and returns it in a list.
* *Delist*, which joins the list that the previous adaptor returned into a string.
This adaptor itself is a class, and this is due to the fact that you must specify which delimiter will join the list. That's why we put an instance to this adaptor in the list.
* *strip*, which (as you may imagine), does the same as the python strings strip method. Cleans up extra spaces before and after the provided string.
In this case, we used the same adaptors for every attribute, because we're practically doing nothing to the data, just extracting it. But there might be situations were certain attributes
are handled different than others (in fact, it *will* happen once you scrape more complicated sites with more complicated data).
@ -70,3 +71,4 @@ are handled different than others (in fact, it *will* happen once you scrape mor
The rest of the code is quite self-explanatory. The *attribute* method sets the item's attributes, and the items themselves are put into a list that we'll return to Scrapy's engine.
One simple (although important) thing to remember here is that you must always return a list that contains either items, requests, or both, but always inside a list.
So, we're almost done! Let's now check the last part of the tutorial: :ref:`tutorial4`

View File

@ -0,0 +1,63 @@
.. _tutorial4:
=================
Finishing the job
=================
| Well, we've got our project, our spider, and our scraped items. What to do next?
| It actually depends on what you want to do with the scraped data.
| In this case, we'll imagine that we want to save this data for storing it in a db later, or just to keep it there.
| To make it simple, we'll export the scraped items to a CSV file by making use of a useful function that Scrapy brings: *items_to_csv*.
This simple function takes a file descriptor/filename, and a list of items, and writes their attributes to that file, in CSV format.
Let's see how would our spider end up looking like after applying this change::
# -*- coding: utf8 -*-
from scrapy.xpath import HtmlXPathSelector
from scrapy.item import ScrapedItem
from scrapy.contrib import adaptors
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.link.extractors import RegexLinkExtractor
from scrapy.utils.misc import items_to_csv
class GoogleDirectorySpider(CrawlSpider):
domain_name = 'google.com'
start_urls = ['http://www.google.com/dirhp']
rules = (
Rule(RegexLinkExtractor(allow=('google.com/[A-Z][a-zA-Z_/]+$', ), ),
'parse_category',
follow=True,
),
)
csv_file = open('scraped_items.csv', 'w')
def parse_category(self, response):
items = [] # The item (links to websites) list we're going to return
hxs = HtmlXPathSelector(response) # The selector we're going to use in order to extract data from the page
links = hxs.x('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td/font')
for link in links:
item = ScrapedItem()
adaptor_pipe = [adaptors.extract, adaptors.Delist(''), adaptors.strip]
item.set_adaptors({
'name': adaptor_pipe,
'url': adaptor_pipe,
'description': adaptor_pipe,
})
item.attribute('name', link.x('a/text()'))
item.attribute('url', link.x('a/@href'))
item.attribute('description', link.x('font[2]/text()'))
items.append(item)
items_to_csv(self.csv_file, items)
return items
SPIDER = GoogleDirectorySpider()
| With this code, our spider will crawl over Google's directory, and save each link's name, description, and url to a file called 'scraped_items.csv'.
Cool, huh?
This is the end of the tutorial. If you'd like to know more about Scrapy and its use, please read the rest of the documentation.