1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-24 18:24:00 +00:00

some improvements to selectors doc structure, added literalincludes for sample1.html (to avoid duplicating the content), renamed that file and moved to _static (so it appears on built doc), moved comments out of source code snippets and into documentation text, and splitted them. converted to required '>>>' console format, and added proper highlighting hints

--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40646
This commit is contained in:
Pablo Hoffman 2009-01-05 02:49:23 +00:00
parent 7ca989a781
commit 5c9c82d055
2 changed files with 44 additions and 47 deletions

View File

@ -4,16 +4,20 @@
Selectors
=========
Selectors are *the* way you have to extract information from documents. They retrieve information from the response's body, given an XPath, or a Regular Expression that you provide.
Selectors are the recommended way to extract information from documents. They retrieve information from the response's body, given an XPath, or a Regular Expression that you provide.
.. highlight:: python
Currently there are two kinds of selectors, HtmlXPathSelectors, and XmlXPathSelectors. Both work in the same way; they are first instanciated with a response, for example::
hxs = HtmlXPathSelector(response) # an HTML selector
xxs = XmlXPathSelector(response) # an XML selector
.. highlight:: sh
Now, before going on with selectors, I'd suggest you to open a Scrapy shell, which you can use by calling your project manager with the 'shell' argument; something like::
[user@host ~/myproject]$ ./scrapy-ctl.py shell <url>
$ ./scrapy-ctl.py shell <url>
Notice that you'll have to install IPython in order to use this feature, but believe me that it worths it; the shell is **very** useful.
@ -21,30 +25,22 @@ With the shell you can simulate parsing a webpage, either by calling "scrapy-ctl
to retreive the given url, and fills in the 'response' variable with the result.
Ok, so now let's use the shell to show you a bit how do selectors work.
We'll use an example page located in Scrapy's site (http://www.scrapy.org/docs/topics/sample1.htm), whose markup is::
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
<a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
<a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
<a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
<a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
</div>
</body>
</html>
We'll use an example page located in Scrapy's (here's a `direct link <../_static/selectors-sample1.html>`_ if you want to download it), whose markup is:
.. literalinclude:: ../_static/selectors-sample1.html
:language: html
.. highlight:: sh
First, we open the shell::
[user@host ~/myproject]$ ./scrapy-ctl.py shell 'http://www.scrapy.org/docs/topics/sample1.htm'
$ ./scrapy-ctl.py shell 'http://www.scrapy.org/docs/topics/sample1.htm'
Then, after the shell loads, you'll have some already-made objects for you to play with. Two of them, hxs and xxs, are selectors.
.. highlight:: python
You could instanciate your own by doing::
from scrapy.xpath.selector import HtmlXPathSelector, XmlXPathSelector
@ -55,41 +51,40 @@ Where 'response' is the object that Scrapy already created for you containing th
But anyway, we'll stick to the selectors that Scrapy already made for us, and more specifically, the HtmlXPathSelector (since we're working with an HTML document right now).
Let's try some expressions::
Let's try extracting the title::
# The title
In [1]: hxs.x('//title/text()')
Out[1]: [<HtmlXPathSelector (text) xpath=//title/text()>]
# As you can see, the x method returns an XPathSelectorList, which is actually a list of selectors.
# To extract their data you must use the extract() method, as follows:
In [2]: hxs.x('//title/text()').extract()
Out[2]: [u'Example website']
>>> hxs.x('//title/text()')
[<HtmlXPathSelector (text) xpath=//title/text()>]
# The base url
In [3]: hxs.x('//base/@href').extract()
Out[3]: [u'http://example.com/']
As you can see, the x method returns an XPathSelectorList, which is actually a list of selectors.
To extract their data you must use the extract() method, as follows::
# Image links
In [4]: hxs.x('//a[contains(@href, "image")]/@href').extract()
Out[4]:
>>> hxs.x('//title/text()').extract()
[u'Example website']
Let's know extract the base URL and some image links::
>>> hxs.x('//base/@href').extract()
[u'http://example.com/']
>>> hxs.x('//a[contains(@href, "image")]/@href').extract()
[u'image1.html',
u'image2.html',
u'image3.html',
u'image4.html',
u'image5.html']
# Image thumbnails
In [5]: hxs.x('//a[contains(@href, "image")]/img/@src').extract()
Out[5]:
>>> hxs.x('//a[contains(@href, "image")]/img/@src').extract()
[u'image1_thumb.jpg',
u'image2_thumb.jpg',
u'image3_thumb.jpg',
u'image4_thumb.jpg',
u'image5_thumb.jpg']
# Image names
In [6]: hxs.x('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
Out[6]:
And here's an example which shows the `re()` method of xpath selectors which
allows you to use regular expressions to select parts.
>>> hxs.x('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
[u'My image 1',
u'My image 2',
u'My image 3',
@ -97,22 +92,24 @@ Let's try some expressions::
u'My image 5']
Ok, let's explain a bit.
Selector's x() method, is intended to select a node or an attribute from the document, given an XPath expression, as you could see upwards.
You can apply an x() call to any node you have, which means that you can join different calls, for example:::
Now let's explain a bit what we just did.
In [10]: links = hxs.x('//a[contains(@href, "image")]')
Selector's x() method, is intended to select a node or an attribute from the
document, given an XPath expression, as you could see upwards.
In [11]: links.extract()
Out[11]:
You can apply an x() call to any node you have, which means that you can join
different calls, for example:::
>>> links = hxs.x('//a[contains(@href, "image")]')
>>> links.extract()
[u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
In [12]: for index, link in enumerate(links):
print 'Link number %d points to url %s and image %s' % (index, link.x('@href').extract(), link.x('img/@src').extract())
>>> for index, link in enumerate(links):
print 'Link number %d points to url %s and image %s' % (index, link.x('@href').extract(), link.x('img/@src').extract())
Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']