mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-26 20:23:53 +00:00
Modified Scrapy tutorial
--HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40502
This commit is contained in:
parent
f9931518aa
commit
14b2a4ff81
@ -2,17 +2,17 @@
|
||||
Setting everything up
|
||||
=====================
|
||||
|
||||
| In this tutorial, we'll teach you how to scrape http://www.dmoz.org, a websites directory.
|
||||
| In this tutorial, we'll teach you how to scrape http://www.google.com/dirhp; Google's web directory.
|
||||
| We'll assume that you've checked and fulfilled the requirements specified in the Download page, and that Scrapy is already installed in your system.
|
||||
|
||||
For starting a new project, enter the directory where you'd like your project to be located, and run::
|
||||
|
||||
scrapy-admin.py startproject dmoz
|
||||
scrapy-admin.py startproject google
|
||||
|
||||
As long as Scrapy is well installed and the path is set, this should create a directory called "dmoz"
|
||||
As long as Scrapy is well installed and the path is set, this should create a directory called "google"
|
||||
containing the following files:
|
||||
|
||||
* *scrapy-ctl.py* - the project's control script. It's used for running the different tasks (like "crawl" and "parse"). We'll talk more about this later.
|
||||
* *scrapy-ctl.py* - the project's control script. It's used for running the different tasks (like "genspider", "crawl" and "parse"). We'll talk more about this later.
|
||||
* *scrapy_settings.py* - the project's settings file.
|
||||
* *items.py* - were you define the different kinds of items you're going to scrape.
|
||||
* *spiders* - directory where you'll later place your spiders.
|
||||
|
@ -3,36 +3,138 @@ Our first spider
|
||||
================
|
||||
|
||||
| Ok, the time to write our first spider has come.
|
||||
| Make sure you're standing on your project's directory and run:
|
||||
| Make sure that you're standing on your project's directory and run:
|
||||
|
||||
::
|
||||
|
||||
./scrapy-ctl genspider dmoz dmoz.org
|
||||
./scrapy-ctl genspider google_directory google.com
|
||||
|
||||
This should create a file called dmoz.py under the *spiders* directory looking similar to this::
|
||||
This should create a file called google_directory.py under the *spiders* directory looking like this::
|
||||
|
||||
# -*- coding: utf8 -*-<
|
||||
# -*- coding: utf8 -*-
|
||||
import re
|
||||
|
||||
from scrapy.xpath import HtmlXPathSelector
|
||||
from scrapy.item import ScrapedItem
|
||||
from scrapy.link.extractors import RegexLinkExtractor
|
||||
from scrapy.contrib.spiders import CrawlSpider, Rule
|
||||
|
||||
class DmozSpider(CrawlSpider):
|
||||
domain_name = "dmoz.org"
|
||||
start_urls = ['http://www.dmoz.org/']
|
||||
class GoogleDirectorySpider(CrawlSpider):
|
||||
domain_name = 'google.com'
|
||||
start_urls = ['http://www.google.com/']
|
||||
|
||||
rules = (
|
||||
Rule(RegexLinkExtractor(allow=(r'Items/', ), 'parse_item', follow=True)
|
||||
Rule(RegexLinkExtractor(allow=(r'Items/', )), 'parse_item', follow=True),
|
||||
)
|
||||
|
||||
def parse_item(self, response):
|
||||
#xs = HtmlXPathSelector(response)
|
||||
#i = ScrapedItem()
|
||||
#i.attribute('site_id', xs.x("//input[@id="sid"]/@value"))
|
||||
#i.attribute('name', xs.x("//div[@id='name']"))
|
||||
#i.attribute('description', xs.x("//div[@id='description']"))
|
||||
#return [i]
|
||||
xs = HtmlXPathSelector(response)
|
||||
i = ScrapedItem()
|
||||
#i.attribute('site_id', xs.x('//input[@id="sid"]/@value'))
|
||||
#i.attribute('name', xs.x('//div[@id="name"]'))
|
||||
#i.attribute('description', xs.x('//div[@id="description"]'))
|
||||
return [i]
|
||||
|
||||
SPIDER = DmozSpider()
|
||||
SPIDER = GoogleDirectorySpider()
|
||||
|
||||
| Now, let's explain a bit what this is all about.
|
||||
| As you may have noticed, the class that represents the spider is GoogleDirectorySpider, and it heredates from CrawlSpider.
|
||||
| This means that this spider will crawl over a website given some crawling rules, and parse the response you need according to your patterns, which are defined through the "rules" class attribute.
|
||||
| This attribute is nothing else but a tuple containing Rule objects. Each Rule defines a specific behaviour the spider will have while crawling the site.
|
||||
| Rule objects accept the following parameters (the ones between [ ] are optional):
|
||||
|
||||
* *link_extractor* - A LinkExtractor instance, which defines the crawling patterns for this Rule.
|
||||
* *[callback]* - A callback to be called for each link extracted matching the previous link extractor.
|
||||
* *[cb_kwargs]* - A dictionary of keyword arguments to be passed to the provided callback.
|
||||
* *[follow]* - A boolean that determines if links are going to be extracted from responses matching this Rule or not.
|
||||
* *[process_links]* - An optional callback for parsing the extracted links.
|
||||
|
||||
| In this case, the Rule would mean something like "search for any url containing the string 'Items/', parse it with the 'parse_item' method, and try to extract more links from it".
|
||||
| Now, that's an example, so we must make our own Rule for our own spider.
|
||||
| But before that, we must set our start_urls to our real entry point (which is not actually Google's homepage).
|
||||
|
||||
So we replace that line with::
|
||||
|
||||
start_urls = ['http://www.google.com/dirhp']
|
||||
|
||||
Now it's the moment to surf that page, and see how we can do to extract data from it.
|
||||
For this task is almost mandatory that you have Firefox FireBug extension, which allows you to browse through HTML markup in an easy and comfortable way. Otherwise you'd have
|
||||
to search for tags manually through the body, which can be *very* tedious.
|
||||
|
||||
[IMG]
|
||||
|
||||
What we see at first sight, is that the directory is divided in categories, which are also divided in subcategories.
|
||||
However, it seems as if there are more subcategories than the ones being shown in this page, so we'll keep looking...
|
||||
|
||||
[IMG]
|
||||
|
||||
Hmmkay... The only new thing here is that there are lots of subcategories, let's see what's inside them...
|
||||
|
||||
[IMG]
|
||||
|
||||
Right, this looks more interesting. Not only subcategories themselves have more subcategories, but they have links to websites (which is in fact the purpose of the directory).
|
||||
Now, there's basically one thing to take into account about the previous, and it's the fact that apparently, categories urls are always of the kind
|
||||
*http://www.google.com/Category/Subcategory/Another_Subcategory* (which is not very distinctive actually, but possible to use).
|
||||
|
||||
So, having said that, a possible rule set for the categories could be::
|
||||
|
||||
rules = (
|
||||
Rule(RegexLinkExtractor(allow=('google.com/[A-Z][a-zA-Z_/]+$', ), ),
|
||||
'parse_category',
|
||||
follow=True,
|
||||
),
|
||||
)
|
||||
|
||||
| Basically, we told our Rule object to extract links that contain the string 'google.com/' plus any capital letter, plus any letter, the '_' character or the '/'.
|
||||
| Also, we set our callback 'parse_category' for each of those crawled links, and decided to extract more links from them with follow=True.
|
||||
|
||||
Until now, our spider would look something like::
|
||||
|
||||
# -*- coding: utf8 -*-
|
||||
from scrapy.xpath import HtmlXPathSelector
|
||||
from scrapy.item import ScrapedItem
|
||||
from scrapy.link.extractors import RegexLinkExtractor
|
||||
from scrapy.contrib.spiders import CrawlSpider, Rule
|
||||
|
||||
class GoogleDirectorySpider(CrawlSpider):
|
||||
domain_name = 'google.com'
|
||||
start_urls = ['http://www.google.com/dirhp']
|
||||
|
||||
rules = (
|
||||
Rule(RegexLinkExtractor(allow=('google.com/[A-Z][a-zA-Z_/]+$', ), ),
|
||||
'parse_category',
|
||||
follow=True,
|
||||
),
|
||||
)
|
||||
|
||||
def parse_category(self, response):
|
||||
pass
|
||||
|
||||
SPIDER = GoogleDirectorySpider()
|
||||
|
||||
|
||||
You can try crawling with this little code, by running::
|
||||
|
||||
./scrapy-ctl crawl google.com
|
||||
|
||||
and it will actually work, altough it won't do any parsing, since parse_category is not defined, and that's exactly what we're going to do now.
|
||||
|
||||
As you can see in any page containing links to websites in the directory (e.g. http://www.google.com/Top/Arts/Awards/), those links are preceded by a
|
||||
ranking bar. That could be a nice reference at the moment of selecting an area with an XPath expression.
|
||||
So, a possible *parse_category* could be::
|
||||
|
||||
def parse_category(self, response):
|
||||
items = []
|
||||
hxs = HtmlXPathSelector(response)
|
||||
links = hxs.x('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td//a')
|
||||
|
||||
for link in links:
|
||||
item = ScrapedItem()
|
||||
item.set_attrib_adaptors('name', [adaptors.extract, adaptors.Delist('')])
|
||||
item.set_attrib_adaptors('url', [adaptors.extract, adaptors.Delist('')])
|
||||
item.attribute('name', link.x('text()'))
|
||||
item.attribute('url', link.x('@href'))
|
||||
items.append(item)
|
||||
|
||||
return items
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user