2009-04-03 01:33:52 +00:00
|
|
|
.. _ref-selectors:
|
|
|
|
|
2009-04-10 05:35:53 +00:00
|
|
|
===================
|
|
|
|
XPath Selectors API
|
|
|
|
===================
|
2009-04-03 01:33:52 +00:00
|
|
|
|
2009-04-03 03:20:45 +00:00
|
|
|
.. module:: scrapy.xpath
|
|
|
|
:synopsis: XPath selectors classes
|
|
|
|
|
2009-04-03 01:33:52 +00:00
|
|
|
There are two types of selectors bundled with Scrapy:
|
|
|
|
:class:`HtmlXPathSelector` and :class:`XmlXPathSelector`. Both of them
|
|
|
|
implement the same :class:`XPathSelector` interface. The only different is that
|
|
|
|
one is used to process HTML data and the other XML data.
|
|
|
|
|
|
|
|
XPathSelector objects
|
|
|
|
=====================
|
|
|
|
|
|
|
|
.. class:: XPathSelector(response)
|
|
|
|
|
|
|
|
A :class:`XPathSelector` object is a wrapper over response to select
|
|
|
|
certain parts of its content.
|
|
|
|
|
|
|
|
A :class:`Request` object represents an HTTP request, which is usually
|
|
|
|
generated in the Spider and executed by the Downloader, and thus generating
|
|
|
|
a :class:`Response`.
|
|
|
|
|
|
|
|
``url`` is a :class:`~scrapy.http.Response` object that will be used for
|
|
|
|
selecting and extracting data
|
|
|
|
|
|
|
|
|
|
|
|
XPathSelector Methods
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
.. method:: XPathSelector.x(xpath)
|
|
|
|
|
|
|
|
Apply the given XPath relative to this XPathSelector and return a list
|
|
|
|
of :class:`XPathSelector` objects (ie. a :class:`XPathSelectorList`) with
|
|
|
|
the result.
|
|
|
|
|
|
|
|
``xpath`` is a string containing the XPath to apply
|
|
|
|
|
|
|
|
.. method:: XPathSelector.re(regex)
|
|
|
|
|
|
|
|
Apply the given regex and return a list of unicode strings with the
|
|
|
|
matches.
|
|
|
|
|
|
|
|
``regex`` can be either a compiled regular expression or a string which
|
|
|
|
will be compiled to a regular expression using ``re.compile(regex)``
|
|
|
|
|
|
|
|
.. method:: XPathSelector.extract()
|
|
|
|
|
|
|
|
Return a unicode string with the content of this :class:`XPathSelector`
|
|
|
|
object.
|
|
|
|
|
|
|
|
.. method:: XPathSelector.extract_unquoted()
|
|
|
|
|
|
|
|
Return a unicode string with the content of this :class:`XPathSelector`
|
|
|
|
without entities or CDATA. This method is intended to be use for text-only
|
|
|
|
selectors, like ``//h1/text()`` (but not ``//h1``). If it's used for
|
|
|
|
:class:`XPathSelector` objects which don't select a textual content (ie. if
|
|
|
|
they contain tags), the output of this method is undefined.
|
|
|
|
|
|
|
|
.. method:: XPathSelector.register_namespace(prefix, uri)
|
|
|
|
|
|
|
|
Register the given namespace to be used in this :class:`XPathSelector`.
|
|
|
|
Without registering namespaces you can't select or extract data from
|
|
|
|
non-standard namespaces. See examples below.
|
|
|
|
|
|
|
|
.. method:: XPathSelector.__nonzero__()
|
|
|
|
|
|
|
|
Returns ``True`` if there is any real content selected by this
|
|
|
|
:class:`XPathSelector` or ``False`` otherwise. In other words, the boolean
|
|
|
|
value of an XPathSelector is given by the contents it selects.
|
|
|
|
|
|
|
|
XPathSelectorList objects
|
|
|
|
=========================
|
|
|
|
|
|
|
|
.. class:: XPathSelectorList
|
|
|
|
|
|
|
|
The :class:`XPathSelectorList` class is subclass of the builtin ``list``
|
|
|
|
class, which provides a few additional methods.
|
|
|
|
|
|
|
|
|
|
|
|
XPathSelectorList Methods
|
|
|
|
-------------------------
|
|
|
|
|
|
|
|
.. method:: XPathSelectorList.x(xpath)
|
|
|
|
|
|
|
|
Call the :meth:`XPathSelector.re` method for all :class:`XPathSelector`
|
|
|
|
objects in this list and return their results flattened, as new
|
|
|
|
:class:`XPathSelectorList`.
|
|
|
|
|
|
|
|
``xpath`` is the same argument as the one in :meth:`XPathSelector.x`
|
|
|
|
|
|
|
|
.. method:: XPathSelector.re(regex)
|
|
|
|
|
|
|
|
Call the :meth:`XPathSelector.re` method for all :class:`XPathSelector`
|
|
|
|
objects in this list and return their results flattened, as a list of
|
|
|
|
unicode strings.
|
|
|
|
|
|
|
|
``regex`` is the same argument as the one in :meth:`XPathSelector.re`
|
|
|
|
|
|
|
|
.. method:: XPathSelector.extract()
|
|
|
|
|
|
|
|
Call the :meth:`XPathSelector.re` method for all :class:`XPathSelector`
|
|
|
|
objects in this list and return their results flattened, as a list of
|
|
|
|
unicode strings.
|
|
|
|
|
|
|
|
.. method:: XPathSelector.extract_unquoted()
|
|
|
|
|
|
|
|
Call the :meth:`XPathSelector.extract_unoquoted` method for all
|
|
|
|
:class:`XPathSelector` objects in this list and return their results
|
|
|
|
flattened, as a list of unicode strings. This method should not be applied
|
|
|
|
to all kinds of XPathSelectors. For more info see
|
|
|
|
:meth:`XPathSelector.extract_unoquoted`.
|
|
|
|
|
|
|
|
HtmlXPathSelector objects
|
|
|
|
=========================
|
|
|
|
|
|
|
|
.. class:: HtmlXPathSelector(response)
|
|
|
|
|
|
|
|
A subclass of :class:`XPathSelector` for working with HTML content. It uses
|
|
|
|
the `libxml2`_ HTML parser. See the :class:`XPathSelector` API for more info.
|
|
|
|
|
|
|
|
.. _libxml2: http://xmlsoft.org/
|
|
|
|
|
|
|
|
HtmlXPathSelector examples
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
Here's a couple of :class:`HtmlXPathSelector` examples to illustrate several
|
|
|
|
concepts. In all cases we assume there is already a :class:`HtmlPathSelector`
|
|
|
|
instanced with a :class:`~scrapy.http.Response` object like this::
|
|
|
|
|
|
|
|
x = HtmlXPathSelector(html_response)
|
|
|
|
|
|
|
|
1. Select all ``<h1>`` elements from a HTML response body, returning a list of
|
|
|
|
:class:`XPathSelector` objects (ie. a :class:`XPathSelectorList` object)::
|
|
|
|
|
|
|
|
x.x("//h1")
|
|
|
|
|
|
|
|
2. Extract the text of all ``<h1>`` elements from a HTML response body,
|
|
|
|
returning a list of unicode strings::
|
|
|
|
|
|
|
|
x.x("//h1").extract() # this includes the h1 tag
|
|
|
|
x.x("//h1/text()").extract() # this excludes the h1 tag
|
|
|
|
|
|
|
|
3. Iterate over all ``<p>`` tags and print their class attribute::
|
|
|
|
|
|
|
|
for node in x.x("//p"):
|
|
|
|
... print node.x("@href")
|
|
|
|
|
|
|
|
4. Extract textual data from all ``<p>`` tags without entities, as a list of
|
|
|
|
unicode strings::
|
|
|
|
|
|
|
|
x.x("//p/text()").extract_unquoted()
|
|
|
|
|
|
|
|
# the following line is wrong. extract_unquoted() should only be used
|
|
|
|
# with textual XPathSelectors
|
|
|
|
x.x("//p").extract_unquoted() # it may work but output is unpredictable
|
|
|
|
|
|
|
|
XmlXPathSelector objects
|
|
|
|
========================
|
|
|
|
|
|
|
|
.. class:: XmlXPathSelector(response)
|
|
|
|
|
|
|
|
A subclass of :class:`XPathSelector` for working with XML content. It uses
|
|
|
|
the `libxml2`_ XML parser. See the :class:`XPathSelector` API for more info.
|
|
|
|
|
|
|
|
XmlXPathSelector examples
|
|
|
|
-------------------------
|
|
|
|
|
|
|
|
Here's a couple of :class:`XmlXPathSelector` examples to illustrate several
|
|
|
|
concepts. In all cases we assume there is already a :class:`XmlPathSelector`
|
|
|
|
instanced with a :class:`~scrapy.http.Response` object like this::
|
|
|
|
|
|
|
|
x = HtmlXPathSelector(xml_response)
|
|
|
|
|
|
|
|
1. Select all ``<product>`` elements from a XML response body, returning a list of
|
|
|
|
:class:`XPathSelector` objects (ie. a :class:`XPathSelectorList` object)::
|
|
|
|
|
|
|
|
x.x("//h1")
|
|
|
|
|
|
|
|
2. Extract all prices from a `Google Base XML feed`_ which requires registering
|
|
|
|
a namespace::
|
|
|
|
|
|
|
|
x.register_namespace("g", "http://base.google.com/ns/1.0")
|
|
|
|
x.x("//g:price").extract()
|
|
|
|
|
|
|
|
.. _Google Base XML feed: http://base.google.com/support/bin/answer.py?hl=en&answer=59461
|