2009-01-03 09:14:52 +00:00
|
|
|
.. _topics-selectors:
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2012-09-13 15:24:44 -03:00
|
|
|
=========
|
|
|
|
Selectors
|
|
|
|
=========
|
2009-01-04 01:15:08 +00:00
|
|
|
|
2009-04-03 01:33:52 +00:00
|
|
|
When you're scraping web pages, the most common task you need to perform is
|
2009-04-10 05:35:53 +00:00
|
|
|
to extract data from the HTML source. There are several libraries available to
|
2012-09-30 14:55:55 -03:00
|
|
|
achieve this:
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2009-04-03 01:33:52 +00:00
|
|
|
* `BeautifulSoup`_ is a very popular screen scraping library among Python
|
2012-09-30 14:55:55 -03:00
|
|
|
programmers which constructs a Python object based on the structure of the
|
|
|
|
HTML code and also deals with bad markup reasonably well, but it has one
|
|
|
|
drawback: it's slow.
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2009-04-03 01:33:52 +00:00
|
|
|
* `lxml`_ is a XML parsing library (which also parses HTML) with a pythonic
|
|
|
|
API based on `ElementTree`_ (which is not part of the Python standard
|
|
|
|
library).
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
Scrapy comes with its own mechanism for extracting data. They're called
|
|
|
|
selectors because they "select" certain parts of the HTML document specified
|
|
|
|
either by `XPath`_ or `CSS`_ expressions.
|
2009-01-05 02:49:23 +00:00
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
`XPath`_ is a language for selecting nodes in XML documents, which can also be
|
|
|
|
used with HTML. `CSS`_ is a language for applying styles to HTML documents. It
|
|
|
|
defines selectors to associate those styles with specific HTML elements.
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2009-04-03 01:33:52 +00:00
|
|
|
Both `lxml`_ and Scrapy Selectors are built over the `libxml2`_ library, which
|
|
|
|
means they're very similar in speed and parsing accuracy.
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2009-04-03 01:33:52 +00:00
|
|
|
This page explains how selectors work and describes their API which is very
|
|
|
|
small and simple, unlike the `lxml`_ API which is much bigger because the
|
2010-08-21 01:26:35 -03:00
|
|
|
`lxml`_ library can be used for many other tasks, besides selecting markup
|
2009-04-03 01:33:52 +00:00
|
|
|
documents.
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
For a complete reference of the selectors API see :ref:`XPath selector
|
|
|
|
reference <topics-xpath-selectors-ref>` and :ref:`CSS selector reference
|
|
|
|
<topics-css-selectors-ref>`.
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2009-04-03 01:33:52 +00:00
|
|
|
.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
|
|
|
|
.. _lxml: http://codespeak.net/lxml/
|
|
|
|
.. _ElementTree: http://docs.python.org/library/xml.etree.elementtree.html
|
|
|
|
.. _libxml2: http://xmlsoft.org/
|
|
|
|
.. _XPath: http://www.w3.org/TR/xpath
|
2012-09-30 14:55:55 -03:00
|
|
|
.. _CSS: http://www.w3.org/TR/selectors
|
2009-01-05 02:49:23 +00:00
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
Using selectors
|
|
|
|
===============
|
|
|
|
|
2009-04-03 01:33:52 +00:00
|
|
|
Constructing selectors
|
|
|
|
----------------------
|
2009-01-05 02:49:23 +00:00
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
There are four types of selectors bundled with Scrapy. Those are:
|
2009-01-05 02:49:23 +00:00
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
* :class:`~scrapy.selector.HtmlXPathSelector` - for working with HTML
|
|
|
|
documents using XPath.
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2009-08-19 21:50:52 -03:00
|
|
|
* :class:`~scrapy.selector.XmlXPathSelector` - for working with XML documents
|
2012-09-30 14:55:55 -03:00
|
|
|
using XPath.
|
|
|
|
|
|
|
|
* :class:`~scrapy.selector.HtmlCSSSelector` - for working with HTML documents
|
|
|
|
using CSS selectors.
|
|
|
|
|
|
|
|
* :class:`~scrapy.selector.XmlCSSSelector` - for working with XML documents
|
|
|
|
using CSS selectors.
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2009-04-03 01:33:52 +00:00
|
|
|
.. highlight:: python
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
All of them share the same selector API, and are constructed with a Response
|
|
|
|
object as their first parameter. This is the Response they're going to be
|
|
|
|
"selecting".
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2009-04-03 01:33:52 +00:00
|
|
|
Example::
|
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
hcs = HtmlCSSSelector(response) # an HTML CSS selector
|
|
|
|
xxs = XmlXPathSelector(response) # an XML XPath selector
|
2009-04-03 01:33:52 +00:00
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
Using selectors
|
|
|
|
---------------
|
2009-04-03 01:33:52 +00:00
|
|
|
|
|
|
|
To explain how to use the selectors we'll use the `Scrapy shell` (which
|
2010-08-21 01:26:35 -03:00
|
|
|
provides interactive testing) and an example page located in the Scrapy
|
2009-04-03 01:33:52 +00:00
|
|
|
documentation server:
|
|
|
|
|
2012-06-24 01:00:33 -03:00
|
|
|
http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
|
2009-04-03 01:33:52 +00:00
|
|
|
|
|
|
|
.. _topics-selectors-htmlcode:
|
|
|
|
|
|
|
|
Here's its HTML code:
|
|
|
|
|
|
|
|
.. literalinclude:: ../_static/selectors-sample1.html
|
|
|
|
:language: html
|
2009-01-05 02:49:23 +00:00
|
|
|
|
2009-04-03 01:33:52 +00:00
|
|
|
.. highlight:: sh
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2009-04-03 01:33:52 +00:00
|
|
|
First, let's open the shell::
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2012-06-24 01:00:33 -03:00
|
|
|
scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
Then, after the shell loads, you'll have some selectors already instantiated
|
|
|
|
and ready to use.
|
2008-12-12 10:50:41 +00:00
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
Since we're dealing with HTML, we can use either the
|
|
|
|
:class:`~scrapy.selector.HtmlXPathSelector` object which is found, by default,
|
|
|
|
in the ``hxs`` shell variable, or the equivalent
|
|
|
|
:class:`~scrapy.selector.HtmlCSSSelector` found in the ``hcs`` shell variable.
|
|
|
|
Note that CSS selectors can only select element nodes, while XPath selectors
|
|
|
|
can select any nodes, including text and comment nodes. There are some methods
|
|
|
|
to augment CSS selectors with XPath as we'll see below.
|
2009-04-03 01:33:52 +00:00
|
|
|
|
|
|
|
.. highlight:: python
|
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
So, by looking at the :ref:`HTML code <topics-selectors-htmlcode>` of that
|
|
|
|
page, let's construct an XPath (using an HTML selector) for selecting the text
|
|
|
|
inside the title tag::
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2009-08-17 15:58:06 -03:00
|
|
|
>>> hxs.select('//title/text()')
|
2009-01-05 02:49:23 +00:00
|
|
|
[<HtmlXPathSelector (text) xpath=//title/text()>]
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
As you can see, the select() method returns an XPathSelectorList, which is a
|
|
|
|
list of new selectors. This API can be used quickly for extracting nested data.
|
2009-04-03 01:33:52 +00:00
|
|
|
|
2010-08-21 01:26:35 -03:00
|
|
|
To actually extract the textual data, you must call the selector ``extract()``
|
2009-04-03 01:33:52 +00:00
|
|
|
method, as follows::
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2009-08-17 15:58:06 -03:00
|
|
|
>>> hxs.select('//title/text()').extract()
|
2009-01-05 02:49:23 +00:00
|
|
|
[u'Example website']
|
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
Now notice that CSS selectors can't select the text nodes. There are some
|
|
|
|
methods that allow enhancing CSS selectors, such as ``text`` and ``get``::
|
|
|
|
|
|
|
|
>>> hcs.select('title').text()
|
|
|
|
[<HtmlCSSSelector xpath='text()' data=u'Example website'>]
|
|
|
|
>>> hcs.select('title').text().extract()
|
|
|
|
[u'Example website']
|
|
|
|
|
2009-04-03 01:33:52 +00:00
|
|
|
Now we're going to get the base URL and some image links::
|
2009-01-05 02:49:23 +00:00
|
|
|
|
2009-08-17 15:58:06 -03:00
|
|
|
>>> hxs.select('//base/@href').extract()
|
2009-01-05 02:49:23 +00:00
|
|
|
[u'http://example.com/']
|
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
>>> hcs.select('base').get('href')
|
|
|
|
[u'http://example.com/']
|
|
|
|
|
2009-08-17 15:58:06 -03:00
|
|
|
>>> hxs.select('//a[contains(@href, "image")]/@href').extract()
|
2008-11-27 15:44:29 +00:00
|
|
|
[u'image1.html',
|
|
|
|
u'image2.html',
|
|
|
|
u'image3.html',
|
|
|
|
u'image4.html',
|
|
|
|
u'image5.html']
|
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
>>> hcs.select('a[href*=image]').get('href').extract()
|
|
|
|
[u'image1.html',
|
|
|
|
u'image2.html',
|
|
|
|
u'image3.html',
|
|
|
|
u'image4.html',
|
|
|
|
u'image5.html']
|
|
|
|
|
2009-08-17 15:58:06 -03:00
|
|
|
>>> hxs.select('//a[contains(@href, "image")]/img/@src').extract()
|
2008-11-27 15:44:29 +00:00
|
|
|
[u'image1_thumb.jpg',
|
|
|
|
u'image2_thumb.jpg',
|
|
|
|
u'image3_thumb.jpg',
|
|
|
|
u'image4_thumb.jpg',
|
|
|
|
u'image5_thumb.jpg']
|
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
>>> hcs.select('a[href*=image] img').get('src').extract()
|
|
|
|
[u'image1_thumb.jpg',
|
|
|
|
u'image2_thumb.jpg',
|
|
|
|
u'image3_thumb.jpg',
|
|
|
|
u'image4_thumb.jpg',
|
|
|
|
u'image5_thumb.jpg']
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2009-07-23 09:05:14 -03:00
|
|
|
.. _topics-selectors-nesting-selectors:
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2009-04-03 01:33:52 +00:00
|
|
|
Nesting selectors
|
|
|
|
-----------------
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
The ``select()`` selector method returns a list of selectors of the same type
|
|
|
|
(XPath or CSS), so you can call the ``select()`` for those selectors too.
|
|
|
|
Here's an example::
|
2008-11-27 15:44:29 +00:00
|
|
|
|
2009-08-17 15:58:06 -03:00
|
|
|
>>> links = hxs.select('//a[contains(@href, "image")]')
|
2009-01-05 02:49:23 +00:00
|
|
|
>>> links.extract()
|
2008-11-27 15:44:29 +00:00
|
|
|
[u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
|
|
|
|
u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
|
|
|
|
u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
|
|
|
|
u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
|
|
|
|
u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
|
|
|
|
|
2009-01-05 02:49:23 +00:00
|
|
|
>>> for index, link in enumerate(links):
|
2009-08-17 15:58:06 -03:00
|
|
|
args = (index, link.select('@href').extract(), link.select('img/@src').extract())
|
2009-04-03 01:33:52 +00:00
|
|
|
print 'Link number %d points to url %s and image %s' % args
|
2008-11-27 15:44:29 +00:00
|
|
|
|
|
|
|
Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
|
|
|
|
Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']
|
|
|
|
Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg']
|
|
|
|
Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
|
|
|
|
Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']
|
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
The CSSSelectorList ``select`` method will accept CSS selectors, as expected,
|
|
|
|
but it also provides an ``xpath`` method that accepts XPath selectors to
|
|
|
|
augment the CSS selectors. Here's an example::
|
|
|
|
|
|
|
|
>>> links = hcs.select('a[href*=image]')
|
|
|
|
>>> for index, link in enumerate(links):
|
|
|
|
args = (index, link.get('href').extract(), link.xpath('img/@src').extract())
|
|
|
|
print 'Link number %d points to url %s and image %s' % args
|
|
|
|
|
|
|
|
Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
|
|
|
|
Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']
|
|
|
|
Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg']
|
|
|
|
Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
|
|
|
|
Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']
|
|
|
|
|
|
|
|
Using selectors with regular expressions
|
|
|
|
----------------------------------------
|
|
|
|
|
|
|
|
Selectors (both CSS and XPath) also have a ``re()`` method for extracting data
|
|
|
|
using regular expressions. However, unlike using the ``select()`` method, the
|
|
|
|
``re()`` method does not return a list of
|
|
|
|
:class:`~scrapy.selector.XPathSelector` objects, so you can't construct nested
|
|
|
|
``.re()`` calls.
|
|
|
|
|
|
|
|
Here's an example used to extract images names from the :ref:`HTML code
|
|
|
|
<topics-selectors-htmlcode>` above::
|
|
|
|
|
|
|
|
>>> hxs.select('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
|
|
|
|
[u'My image 1',
|
|
|
|
u'My image 2',
|
|
|
|
u'My image 3',
|
|
|
|
u'My image 4',
|
|
|
|
u'My image 5']
|
|
|
|
|
2009-07-23 09:05:14 -03:00
|
|
|
.. _topics-selectors-relative-xpaths:
|
|
|
|
|
2009-07-16 17:29:29 -03:00
|
|
|
Working with relative XPaths
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
Keep in mind that if you are nesting XPathSelectors and use an XPath that
|
|
|
|
starts with ``/``, that XPath will be absolute to the document and not relative
|
|
|
|
to the ``XPathSelector`` you're calling it from.
|
|
|
|
|
|
|
|
For example, suppose you want to extract all ``<p>`` elements inside ``<div>``
|
2010-08-21 01:26:35 -03:00
|
|
|
elements. First, you would get all ``<div>`` elements::
|
2009-07-16 17:29:29 -03:00
|
|
|
|
2009-08-17 15:58:06 -03:00
|
|
|
>>> divs = hxs.select('//div')
|
2009-07-16 17:29:29 -03:00
|
|
|
|
|
|
|
At first, you may be tempted to use the following approach, which is wrong, as
|
|
|
|
it actually extracts all ``<p>`` elements from the document, not only those
|
|
|
|
inside ``<div>`` elements::
|
|
|
|
|
2009-08-17 15:58:06 -03:00
|
|
|
>>> for p in divs.select('//p') # this is wrong - gets all <p> from the whole document
|
2009-07-16 17:29:29 -03:00
|
|
|
>>> print p.extract()
|
|
|
|
|
|
|
|
This is the proper way to do it (note the dot prefixing the ``.//p`` XPath)::
|
|
|
|
|
2009-08-17 15:58:06 -03:00
|
|
|
>>> for p in divs.select('.//p') # extracts all <p> inside
|
2009-07-16 17:29:29 -03:00
|
|
|
>>> print p.extract()
|
|
|
|
|
|
|
|
Another common case would be to extract all direct ``<p>`` children::
|
|
|
|
|
2009-08-17 15:58:06 -03:00
|
|
|
>>> for p in divs.select('p')
|
2009-07-16 17:29:29 -03:00
|
|
|
>>> print p.extract()
|
|
|
|
|
|
|
|
For more details about relative XPaths see the `Location Paths`_ section in the
|
|
|
|
XPath specification.
|
|
|
|
|
|
|
|
.. _Location Paths: http://www.w3.org/TR/xpath#location-paths
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
|
|
|
|
.. _topics-selectors-ref:
|
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
Built-in Selectors reference
|
2009-08-18 14:05:15 -03:00
|
|
|
==================================
|
|
|
|
|
2009-08-19 21:50:52 -03:00
|
|
|
.. module:: scrapy.selector
|
2012-09-30 14:55:55 -03:00
|
|
|
:synopsis: Selectors classes
|
|
|
|
|
|
|
|
There are four types of selectors bundled with Scrapy:
|
|
|
|
:class:`HtmlXPathSelector` and :class:`XmlXPathSelector`,
|
|
|
|
:class:`HtmlCSSSelector` and :class:`XmlCSSSelector`. All of them implement the
|
|
|
|
same :class:`XPathSelector` interface. The only differences are the selector
|
|
|
|
syntax and whether it is used to process HTML data or XML data.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
.. _topics-xpath-selectors-ref:
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
XPathSelector objects
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
.. class:: XPathSelector(response)
|
|
|
|
|
2009-08-18 15:13:23 -03:00
|
|
|
A :class:`XPathSelector` object is a wrapper over response to select
|
|
|
|
certain parts of its content.
|
|
|
|
|
2009-08-18 15:18:49 -03:00
|
|
|
``response`` is a :class:`~scrapy.http.Response` object that will be used
|
2012-09-30 14:55:55 -03:00
|
|
|
for selecting and extracting data.
|
2009-08-18 15:13:23 -03:00
|
|
|
|
2010-08-21 01:26:35 -03:00
|
|
|
.. method:: select(xpath)
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
Apply the given XPath relative to this XPathSelector and return a list
|
2012-09-30 14:55:55 -03:00
|
|
|
of :class:`XPathSelector` objects (ie. a :class:`XPathSelectorList`)
|
|
|
|
with the result.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
``xpath`` is a string containing the XPath to apply
|
|
|
|
|
2010-08-21 01:26:35 -03:00
|
|
|
.. method:: re(regex)
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
Apply the given regex and return a list of unicode strings with the
|
|
|
|
matches.
|
|
|
|
|
|
|
|
``regex`` can be either a compiled regular expression or a string which
|
|
|
|
will be compiled to a regular expression using ``re.compile(regex)``
|
|
|
|
|
2010-08-21 01:26:35 -03:00
|
|
|
.. method:: extract()
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
Return a unicode string with the content of this :class:`XPathSelector`
|
|
|
|
object.
|
|
|
|
|
2010-08-21 01:26:35 -03:00
|
|
|
.. method:: register_namespace(prefix, uri)
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
Register the given namespace to be used in this :class:`XPathSelector`.
|
|
|
|
Without registering namespaces you can't select or extract data from
|
|
|
|
non-standard namespaces. See examples below.
|
|
|
|
|
2013-01-18 12:19:58 -02:00
|
|
|
.. method:: remove_namespaces()
|
|
|
|
|
|
|
|
Remove all namespaces, allowing to traverse the document using
|
|
|
|
namespace-less xpaths. See example below.
|
|
|
|
|
2010-08-21 01:26:35 -03:00
|
|
|
.. method:: __nonzero__()
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
Returns ``True`` if there is any real content selected by this
|
2012-09-30 14:55:55 -03:00
|
|
|
:class:`XPathSelector` or ``False`` otherwise. In other words, the
|
|
|
|
boolean value of an XPathSelector is given by the contents it selects.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
XPathSelectorList objects
|
|
|
|
-------------------------
|
|
|
|
|
|
|
|
.. class:: XPathSelectorList
|
|
|
|
|
2009-08-18 15:13:23 -03:00
|
|
|
The :class:`XPathSelectorList` class is subclass of the builtin ``list``
|
|
|
|
class, which provides a few additional methods.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2010-08-21 01:26:35 -03:00
|
|
|
.. method:: select(xpath)
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
Call the :meth:`XPathSelector.select` method for all
|
|
|
|
:class:`XPathSelector` objects in this list and return their results
|
|
|
|
flattened, as a new :class:`XPathSelectorList`.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
``xpath`` is the same argument as the one in
|
|
|
|
:meth:`XPathSelector.select`
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2010-08-21 01:26:35 -03:00
|
|
|
.. method:: re(regex)
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
Call the :meth:`XPathSelector.re` method for all :class:`XPathSelector`
|
|
|
|
objects in this list and return their results flattened, as a list of
|
|
|
|
unicode strings.
|
|
|
|
|
|
|
|
``regex`` is the same argument as the one in :meth:`XPathSelector.re`
|
|
|
|
|
2010-08-21 01:26:35 -03:00
|
|
|
.. method:: extract()
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
Call the :meth:`XPathSelector.extract` method for all
|
|
|
|
:class:`XPathSelector` objects in this list and return their results
|
|
|
|
flattened, as a list of unicode strings.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2010-08-21 01:26:35 -03:00
|
|
|
.. method:: extract_unquoted()
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
Call the :meth:`XPathSelector.extract_unoquoted` method for all
|
|
|
|
:class:`XPathSelector` objects in this list and return their results
|
2012-09-30 14:55:55 -03:00
|
|
|
flattened, as a list of unicode strings. This method should not be
|
|
|
|
applied to all kinds of XPathSelectors. For more info see
|
2009-08-18 14:05:15 -03:00
|
|
|
:meth:`XPathSelector.extract_unoquoted`.
|
|
|
|
|
|
|
|
HtmlXPathSelector objects
|
|
|
|
-------------------------
|
|
|
|
|
|
|
|
.. class:: HtmlXPathSelector(response)
|
|
|
|
|
|
|
|
A subclass of :class:`XPathSelector` for working with HTML content. It uses
|
2012-09-30 14:55:55 -03:00
|
|
|
the `libxml2`_ HTML parser. See the :class:`XPathSelector` API for more
|
|
|
|
info.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
.. _libxml2: http://xmlsoft.org/
|
|
|
|
|
|
|
|
HtmlXPathSelector examples
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
Here's a couple of :class:`HtmlXPathSelector` examples to illustrate several
|
2012-09-30 14:55:55 -03:00
|
|
|
concepts. In all cases, we assume there is already an
|
|
|
|
:class:`HtmlXPathSelector` instantiated with a :class:`~scrapy.http.Response`
|
|
|
|
object like this::
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
x = HtmlXPathSelector(html_response)
|
|
|
|
|
|
|
|
1. Select all ``<h1>`` elements from a HTML response body, returning a list of
|
|
|
|
:class:`XPathSelector` objects (ie. a :class:`XPathSelectorList` object)::
|
|
|
|
|
|
|
|
x.select("//h1")
|
|
|
|
|
|
|
|
2. Extract the text of all ``<h1>`` elements from a HTML response body,
|
|
|
|
returning a list of unicode strings::
|
|
|
|
|
|
|
|
x.select("//h1").extract() # this includes the h1 tag
|
|
|
|
x.select("//h1/text()").extract() # this excludes the h1 tag
|
|
|
|
|
|
|
|
3. Iterate over all ``<p>`` tags and print their class attribute::
|
|
|
|
|
|
|
|
for node in x.select("//p"):
|
2012-09-30 14:55:55 -03:00
|
|
|
... print node.select("@class").extract()
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
4. Extract textual data from all ``<p>`` tags without entities, as a list of
|
|
|
|
unicode strings::
|
|
|
|
|
|
|
|
x.select("//p/text()").extract_unquoted()
|
|
|
|
|
|
|
|
# the following line is wrong. extract_unquoted() should only be used
|
|
|
|
# with textual XPathSelectors
|
|
|
|
x.select("//p").extract_unquoted() # it may work but output is unpredictable
|
|
|
|
|
|
|
|
XmlXPathSelector objects
|
|
|
|
------------------------
|
|
|
|
|
|
|
|
.. class:: XmlXPathSelector(response)
|
|
|
|
|
|
|
|
A subclass of :class:`XPathSelector` for working with XML content. It uses
|
|
|
|
the `libxml2`_ XML parser. See the :class:`XPathSelector` API for more info.
|
|
|
|
|
|
|
|
XmlXPathSelector examples
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
Here's a couple of :class:`XmlXPathSelector` examples to illustrate several
|
2012-09-30 14:55:55 -03:00
|
|
|
concepts. In both cases we assume there is already an :class:`XmlXPathSelector`
|
2010-08-21 01:26:35 -03:00
|
|
|
instantiated with a :class:`~scrapy.http.Response` object like this::
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2013-03-17 19:34:55 +00:00
|
|
|
x = XmlXPathSelector(xml_response)
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2012-09-30 14:55:55 -03:00
|
|
|
1. Select all ``<product>`` elements from a XML response body, returning a list
|
|
|
|
of :class:`XPathSelector` objects (ie. a :class:`XPathSelectorList`
|
|
|
|
object)::
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2012-09-18 12:56:52 -05:00
|
|
|
x.select("//product")
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
2. Extract all prices from a `Google Base XML feed`_ which requires registering
|
|
|
|
a namespace::
|
|
|
|
|
|
|
|
x.register_namespace("g", "http://base.google.com/ns/1.0")
|
|
|
|
x.select("//g:price").extract()
|
|
|
|
|
2013-01-18 12:35:30 -02:00
|
|
|
.. _removing-namespaces:
|
|
|
|
|
2013-01-18 12:19:58 -02:00
|
|
|
Removing namespaces
|
|
|
|
~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
When dealing with scraping projects, it is often quite convenient to get rid of
|
2013-01-18 12:35:30 -02:00
|
|
|
namespaces altogether and just work with element names, to write more
|
|
|
|
simple/convenient XPaths. You can use the
|
|
|
|
:meth:`XPathSelector.remove_namespaces` method for that.
|
|
|
|
|
|
|
|
Let's show an example that illustrates this with Github blog atom feed.
|
2013-01-18 12:19:58 -02:00
|
|
|
|
2013-01-18 12:35:30 -02:00
|
|
|
First, we open the shell with the url we want to scrape::
|
2013-01-18 12:19:58 -02:00
|
|
|
|
|
|
|
$ scrapy shell https://github.com/blog.atom
|
2013-01-18 12:35:30 -02:00
|
|
|
|
|
|
|
Once in the shell we can try selecting all ``<link>`` objects and see that it
|
|
|
|
doesn't work (because the Atom XML namespace is obfuscating those nodes)::
|
|
|
|
|
2013-01-18 12:19:58 -02:00
|
|
|
>>> xxs.select("//link")
|
|
|
|
[]
|
2013-01-18 12:35:30 -02:00
|
|
|
|
|
|
|
But once we call the :meth:`XPathSelector.remove_namespaces` method, all
|
|
|
|
nodes can be accessed directly by their names::
|
|
|
|
|
2013-01-18 12:19:58 -02:00
|
|
|
>>> xxs.remove_namespaces()
|
|
|
|
>>> xxs.select("//link")
|
|
|
|
[<XmlXPathSelector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
|
|
|
|
<XmlXPathSelector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
|
|
|
|
...
|
2013-01-18 12:35:30 -02:00
|
|
|
|
|
|
|
If you wonder why the namespace removal procedure is not always called, instead
|
|
|
|
of having to call it manually. This is because of two reasons which, in order
|
|
|
|
of relevance, are:
|
|
|
|
|
|
|
|
1. removing namespaces requires to iterate and modify all nodes in the
|
|
|
|
document, which is a reasonably expensive operation to performs for all
|
|
|
|
documents crawled by Scrapy
|
|
|
|
|
|
|
|
2. there could be some cases where using namespaces is actually required, in
|
|
|
|
case some element names clash between namespaces. These cases are very rare
|
|
|
|
though.
|
2013-01-18 12:19:58 -02:00
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
.. _Google Base XML feed: http://base.google.com/support/bin/answer.py?hl=en&answer=59461
|
2012-09-30 14:55:55 -03:00
|
|
|
|
|
|
|
.. _topics-css-selectors-ref:
|
|
|
|
|
|
|
|
CSSSelector objects
|
|
|
|
--------------------
|
|
|
|
|
|
|
|
.. class:: CSSSelectorMixin(object)
|
|
|
|
|
|
|
|
A :class:`CSSSelectorMixin` object is a mixin for either
|
|
|
|
:class:`XmlXPathSelector` or :class:`HtmlXPathSelector` to select element
|
|
|
|
nodes using CSS selectors syntax. As a mixin, it is not meant to be used on
|
|
|
|
its own, but as a secondary parent class. See :class:`XmlCSSSelector` and
|
|
|
|
:class:`HtmlCSSSelector` for implementations.
|
|
|
|
|
|
|
|
.. method:: select(css)
|
|
|
|
|
|
|
|
Apply the given CSS selector relative to this CSSSelectorMixin and
|
|
|
|
return a list of :class:`CSSSelectorMixin` objects (ie. a
|
|
|
|
:class:`CSSSelectorList`) with the result.
|
|
|
|
|
|
|
|
``css`` is a string containing the CSS selector to apply.
|
|
|
|
|
|
|
|
.. method:: xpath(xpath)
|
|
|
|
|
|
|
|
Apply the given XPath relative to this CSSSelectorMixin and return a list
|
|
|
|
of :class:`CSSSelectorMixin` objects (ie. a :class:`CSSSelectorList`)
|
|
|
|
with the result.
|
|
|
|
|
|
|
|
``xpath`` is a string containing the XPath to apply.
|
|
|
|
|
|
|
|
.. method:: get(attr)
|
|
|
|
|
|
|
|
Get the attribute relative to this CSSSelectorMixin and return a list
|
|
|
|
of :class:`CSSSelectorMixin` objects (ie. a :class:`CSSSelectorList`)
|
|
|
|
with the result (usually with one element only).
|
|
|
|
|
|
|
|
``attr`` is a string containing the attribute name to get.
|
|
|
|
|
|
|
|
.. method:: text(all=False)
|
|
|
|
|
|
|
|
Get the children text nodes relative to this CSSSelectorMixin or, if
|
|
|
|
``all`` is True, a string node concatenating all of the descendant text
|
|
|
|
nodes relative to this CSSSelectorMixin, and return a list of
|
|
|
|
:class:`CSSSelectorMixin` objects (ie. a :class:`CSSSelectorList`) with
|
|
|
|
the result.
|
|
|
|
|
|
|
|
``all`` is a boolean to either select children text nodes (False) or
|
|
|
|
select a string node concatenating all of the descendant text nodes.
|
|
|
|
|
|
|
|
CSSSelectorList objects
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
.. class:: CSSSelectorList
|
|
|
|
|
|
|
|
The :class:`CSSSelectorList` class is subclass of :class:`XPathSelectorList`
|
|
|
|
which overrides and adds methods to match those of
|
|
|
|
:class:`CSSSelectorMixin`.
|
|
|
|
|
|
|
|
.. method:: xpath(xpath)
|
|
|
|
|
|
|
|
Call the :meth:`CSSSelectorMixin.xpath` method for all
|
|
|
|
:class:`CSSSelectorMixin` objects in this list and return their results
|
|
|
|
flattened, as a new :class:`CSSSelectorList`.
|
|
|
|
|
|
|
|
``xpath`` is the same argument as the one in
|
|
|
|
:meth:`CSSSelectorMixin.xpath`
|
|
|
|
|
|
|
|
.. method:: get(attr)
|
|
|
|
|
|
|
|
Call the :meth:`CSSSelectorMixin.get` method for all
|
|
|
|
:class:`CSSSelectorMixin` objects in this list and return their results
|
|
|
|
flattened, as a new :class:`CSSSelectorList`.
|
|
|
|
|
|
|
|
``attr`` is the same argument as the one in :meth:`CSSSelectorMixin.get`
|
|
|
|
|
|
|
|
.. method:: text(all=False)
|
|
|
|
|
|
|
|
Call the :meth:`CSSSelectorMixin.text` method for all
|
|
|
|
:class:`CSSSelectorMixin` objects in this list and return their results
|
|
|
|
flattened, as a new :class:`CSSSelectorList`.
|
|
|
|
|
|
|
|
``all`` is the same argument as the one in :meth:`CSSSelectorMixin.text`
|
|
|
|
|
|
|
|
HtmlCSSSelector objects
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
.. class:: HtmlCSSSelector(response)
|
|
|
|
|
|
|
|
A subclass of :class:`CSSSelectorMixin` and :class:`HtmlXPathSelector` for
|
|
|
|
working with HTML content using CSS selectors.
|
|
|
|
|
|
|
|
HtmlCSSSelector examples
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
Here's a couple of :class:`HtmlCSSSelector` examples to illustrate several
|
|
|
|
concepts. In all cases, we assume there is already an :class:`HtmlCSSSelector`
|
|
|
|
instantiated with a :class:`~scrapy.http.Response` object like this::
|
|
|
|
|
|
|
|
x = HtmlCSSSelector(html_response)
|
|
|
|
|
|
|
|
1. Select all ``<h1>`` elements from a HTML response body, returning a list of
|
|
|
|
:class:`HtmlCSSSelector` objects (ie. a :class:`CSSSelectorList` object)::
|
|
|
|
|
|
|
|
x.select("h1")
|
|
|
|
|
|
|
|
2. Extract the text of all ``<h1>`` elements from a HTML response body,
|
|
|
|
returning a list of unicode strings::
|
|
|
|
|
|
|
|
x.select("h1").extract() # this includes the h1 tag
|
|
|
|
x.select("h1").text().extract() # this excludes the h1 tag
|
|
|
|
|
|
|
|
3. Iterate over all ``<p>`` tags and print their class attribute::
|
|
|
|
|
|
|
|
for node in x.select("p"):
|
|
|
|
... print node.get("class").extract()
|
|
|
|
|
|
|
|
XmlCSSSelector objects
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
.. class:: XmlCSSSelector(response)
|
|
|
|
|
|
|
|
A subclass of :class:`CSSSelectorMixin` and :class:`XmlXPathSelector` for
|
|
|
|
working with XML content using CSS selectors.
|
|
|
|
|
|
|
|
XmlCSSSelector examples
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
Here's a couple of :class:`XmlCSSSelector` examples to illustrate several
|
|
|
|
concepts. In both cases we assume there is already an :class:`XmlCSSSelector`
|
|
|
|
instantiated with a :class:`~scrapy.http.Response` object like this::
|
|
|
|
|
|
|
|
x = XmlCSSSelector(xml_response)
|
|
|
|
|
|
|
|
1. Select all ``<product>`` elements from a XML response body, returning a list
|
|
|
|
of :class:`XmlCSSSelector` objects (ie. a :class:`CSSSelectorList` object)::
|
|
|
|
|
|
|
|
x.select("product")
|
|
|
|
|
|
|
|
2. Extract all prices from a `Google Base XML feed`_ which requires registering
|
|
|
|
a namespace::
|
|
|
|
|
|
|
|
x.register_namespace("g", "http://base.google.com/ns/1.0")
|
|
|
|
x.xpath("//g:price").extract()
|
|
|
|
|
|
|
|
.. _Google Base XML feed: http://base.google.com/support/bin/answer.py?hl=en&answer=59461
|