mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-26 20:44:04 +00:00
Updated scrapy tutorial
--HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40550
This commit is contained in:
parent
867ee20c66
commit
c7332cd372
@ -7,3 +7,4 @@ Tutorial
|
||||
tutorial1
|
||||
tutorial2
|
||||
tutorial3
|
||||
tutorial4
|
||||
|
@ -32,11 +32,11 @@ Anyway, having said that, a possible *parse_category* could be::
|
||||
def parse_category(self, response):
|
||||
items = [] # The item (links to websites) list we're going to return
|
||||
hxs = HtmlXPathSelector(response) # The selector we're going to use in order to extract data from the page
|
||||
links = hxs.x('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td//a')
|
||||
links = hxs.x('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td/font')
|
||||
|
||||
for link in links:
|
||||
item = ScrapedItem()
|
||||
adaptor_pipe = [adaptors.extract, adaptors.Delist('')]
|
||||
adaptor_pipe = [adaptors.extract, adaptors.Delist(''), adaptors.strip]
|
||||
item.set_adaptors({
|
||||
'name': adaptor_pipe,
|
||||
'url': adaptor_pipe,
|
||||
@ -62,6 +62,7 @@ Anyway, having said that, a possible *parse_category* could be::
|
||||
* An extractor (*extract*), which, as you may imagine, extracts the data from the XPath nodes you provide, and returns it in a list.
|
||||
* *Delist*, which joins the list that the previous adaptor returned into a string.
|
||||
This adaptor itself is a class, and this is due to the fact that you must specify which delimiter will join the list. That's why we put an instance to this adaptor in the list.
|
||||
* *strip*, which (as you may imagine), does the same as the python strings strip method. Cleans up extra spaces before and after the provided string.
|
||||
|
||||
In this case, we used the same adaptors for every attribute, because we're practically doing nothing to the data, just extracting it. But there might be situations were certain attributes
|
||||
are handled different than others (in fact, it *will* happen once you scrape more complicated sites with more complicated data).
|
||||
@ -70,3 +71,4 @@ are handled different than others (in fact, it *will* happen once you scrape mor
|
||||
The rest of the code is quite self-explanatory. The *attribute* method sets the item's attributes, and the items themselves are put into a list that we'll return to Scrapy's engine.
|
||||
One simple (although important) thing to remember here is that you must always return a list that contains either items, requests, or both, but always inside a list.
|
||||
|
||||
So, we're almost done! Let's now check the last part of the tutorial: :ref:`tutorial4`
|
||||
|
63
sites/scrapy.org/docs/sources/tutorial/tutorial4.rst
Normal file
63
sites/scrapy.org/docs/sources/tutorial/tutorial4.rst
Normal file
@ -0,0 +1,63 @@
|
||||
.. _tutorial4:
|
||||
|
||||
=================
|
||||
Finishing the job
|
||||
=================
|
||||
|
||||
| Well, we've got our project, our spider, and our scraped items. What to do next?
|
||||
| It actually depends on what you want to do with the scraped data.
|
||||
| In this case, we'll imagine that we want to save this data for storing it in a db later, or just to keep it there.
|
||||
| To make it simple, we'll export the scraped items to a CSV file by making use of a useful function that Scrapy brings: *items_to_csv*.
|
||||
This simple function takes a file descriptor/filename, and a list of items, and writes their attributes to that file, in CSV format.
|
||||
|
||||
Let's see how would our spider end up looking like after applying this change::
|
||||
|
||||
# -*- coding: utf8 -*-
|
||||
from scrapy.xpath import HtmlXPathSelector
|
||||
from scrapy.item import ScrapedItem
|
||||
from scrapy.contrib import adaptors
|
||||
from scrapy.contrib.spiders import CrawlSpider, Rule
|
||||
from scrapy.link.extractors import RegexLinkExtractor
|
||||
from scrapy.utils.misc import items_to_csv
|
||||
|
||||
class GoogleDirectorySpider(CrawlSpider):
|
||||
domain_name = 'google.com'
|
||||
start_urls = ['http://www.google.com/dirhp']
|
||||
|
||||
rules = (
|
||||
Rule(RegexLinkExtractor(allow=('google.com/[A-Z][a-zA-Z_/]+$', ), ),
|
||||
'parse_category',
|
||||
follow=True,
|
||||
),
|
||||
)
|
||||
csv_file = open('scraped_items.csv', 'w')
|
||||
|
||||
def parse_category(self, response):
|
||||
items = [] # The item (links to websites) list we're going to return
|
||||
hxs = HtmlXPathSelector(response) # The selector we're going to use in order to extract data from the page
|
||||
links = hxs.x('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td/font')
|
||||
|
||||
for link in links:
|
||||
item = ScrapedItem()
|
||||
adaptor_pipe = [adaptors.extract, adaptors.Delist(''), adaptors.strip]
|
||||
item.set_adaptors({
|
||||
'name': adaptor_pipe,
|
||||
'url': adaptor_pipe,
|
||||
'description': adaptor_pipe,
|
||||
})
|
||||
|
||||
item.attribute('name', link.x('a/text()'))
|
||||
item.attribute('url', link.x('a/@href'))
|
||||
item.attribute('description', link.x('font[2]/text()'))
|
||||
items.append(item)
|
||||
|
||||
items_to_csv(self.csv_file, items)
|
||||
return items
|
||||
|
||||
SPIDER = GoogleDirectorySpider()
|
||||
|
||||
|
||||
| With this code, our spider will crawl over Google's directory, and save each link's name, description, and url to a file called 'scraped_items.csv'.
|
||||
Cool, huh?
|
||||
|
||||
This is the end of the tutorial. If you'd like to know more about Scrapy and its use, please read the rest of the documentation.
|
Loading…
x
Reference in New Issue
Block a user