scrapy/docs/ref/selectors.rst

.. _ref-selectors:

===================
XPath Selectors API
===================

.. module:: scrapy.xpath
   :synopsis: XPath selectors classes

There are two types of selectors bundled with Scrapy:
:class:`HtmlXPathSelector` and :class:`XmlXPathSelector`. Both of them
implement the same :class:`XPathSelector` interface. The only different is that
one is used to process HTML data and the other XML data.

XPathSelector objects
=====================

.. class:: XPathSelector(response)

    A :class:`XPathSelector` object is a wrapper over response to select
    certain parts of its content.

    A :class:`Request` object represents an HTTP request, which is usually
    generated in the Spider and executed by the Downloader, and thus generating
    a :class:`Response`.
    
    ``url`` is a :class:`~scrapy.http.Response` object that will be used for
       selecting and extracting data 
   

XPathSelector Methods
---------------------

.. method:: XPathSelector.x(xpath)

    Apply the given XPath relative to this XPathSelector and return a list
    of :class:`XPathSelector` objects (ie. a :class:`XPathSelectorList`) with
    the result.

    ``xpath`` is a string containing the XPath to apply

.. method:: XPathSelector.re(regex)

    Apply the given regex and return a list of unicode strings with the
    matches.

    ``regex`` can be either a compiled regular expression or a string which
    will be compiled to a regular expression using ``re.compile(regex)``

.. method:: XPathSelector.extract()

    Return a unicode string with the content of this :class:`XPathSelector`
    object.

.. method:: XPathSelector.extract_unquoted()

    Return a unicode string with the content of this :class:`XPathSelector`
    without entities or CDATA. This method is intended to be use for text-only
    selectors, like ``//h1/text()`` (but not ``//h1``). If it's used for
    :class:`XPathSelector` objects which don't select a textual content (ie. if
    they contain tags), the output of this method is undefined.

.. method:: XPathSelector.register_namespace(prefix, uri)

    Register the given namespace to be used in this :class:`XPathSelector`.
    Without registering namespaces you can't select or extract data from
    non-standard namespaces. See examples below.

.. method:: XPathSelector.__nonzero__()

    Returns ``True`` if there is any real content selected by this
    :class:`XPathSelector` or ``False`` otherwise.  In other words, the boolean
    value of an XPathSelector is given by the contents it selects. 

XPathSelectorList objects
=========================

.. class:: XPathSelectorList

    The :class:`XPathSelectorList` class is subclass of the builtin ``list``
    class, which provides a few additional methods.


XPathSelectorList Methods
-------------------------

.. method:: XPathSelectorList.x(xpath)

    Call the :meth:`XPathSelector.re` method for all :class:`XPathSelector`
    objects in this list and return their results flattened, as new
    :class:`XPathSelectorList`.

    ``xpath`` is the same argument as the one in :meth:`XPathSelector.x`

.. method:: XPathSelector.re(regex)

    Call the :meth:`XPathSelector.re` method for all :class:`XPathSelector`
    objects in this list and return their results flattened, as a list of
    unicode strings.

    ``regex`` is the same argument as the one in :meth:`XPathSelector.re`

.. method:: XPathSelector.extract()

    Call the :meth:`XPathSelector.re` method for all :class:`XPathSelector`
    objects in this list and return their results flattened, as a list of
    unicode strings.

.. method:: XPathSelector.extract_unquoted()

    Call the :meth:`XPathSelector.extract_unoquoted` method for all
    :class:`XPathSelector` objects in this list and return their results
    flattened, as a list of unicode strings. This method should not be applied
    to all kinds of XPathSelectors. For more info see
    :meth:`XPathSelector.extract_unoquoted`.

HtmlXPathSelector objects
=========================

.. class:: HtmlXPathSelector(response)

   A subclass of :class:`XPathSelector` for working with HTML content. It uses
   the `libxml2`_ HTML parser. See the :class:`XPathSelector` API for more info.

.. _libxml2: http://xmlsoft.org/

HtmlXPathSelector examples
--------------------------

Here's a couple of :class:`HtmlXPathSelector` examples to illustrate several
concepts.  In all cases we assume there is already a :class:`HtmlPathSelector`
instanced with a :class:`~scrapy.http.Response` object like this::

      x = HtmlXPathSelector(html_response)

1. Select all ``<h1>`` elements from a HTML response body, returning a list of
   :class:`XPathSelector` objects (ie. a :class:`XPathSelectorList` object)::

      x.x("//h1")

2. Extract the text of all ``<h1>`` elements from a HTML response body,
   returning a list of unicode strings::

      x.x("//h1").extract()         # this includes the h1 tag
      x.x("//h1/text()").extract()  # this excludes the h1 tag

3. Iterate over all ``<p>`` tags and print their class attribute::

      for node in x.x("//p"):
      ...    print node.x("@href")

4. Extract textual data from all ``<p>`` tags without entities, as a list of
   unicode strings::

      x.x("//p/text()").extract_unquoted()

      # the following line is wrong. extract_unquoted() should only be used
      # with textual XPathSelectors
      x.x("//p").extract_unquoted()  # it may work but output is unpredictable

XmlXPathSelector objects
========================

.. class:: XmlXPathSelector(response)

   A subclass of :class:`XPathSelector` for working with XML content. It uses
   the `libxml2`_ XML parser. See the :class:`XPathSelector` API for more info.

XmlXPathSelector examples
-------------------------

Here's a couple of :class:`XmlXPathSelector` examples to illustrate several
concepts.  In all cases we assume there is already a :class:`XmlPathSelector`
instanced with a :class:`~scrapy.http.Response` object like this::

      x = HtmlXPathSelector(xml_response)

1. Select all ``<product>`` elements from a XML response body, returning a list of
   :class:`XPathSelector` objects (ie. a :class:`XPathSelectorList` object)::

      x.x("//h1")

2. Extract all prices from a `Google Base XML feed`_ which requires registering
   a namespace::

      x.register_namespace("g", "http://base.google.com/ns/1.0")
      x.x("//g:price").extract()

.. _Google Base XML feed: http://base.google.com/support/bin/answer.py?hl=en&answer=59461
massive improvements to xpath selectors doc. refs #25 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401033 2009-04-03 01:33:52 +00:00			`.. _ref-selectors:`

Several documentation changes: - merged (and updated) new tutorial from proposed doc - striped old tutorial and created new firebug topic - added topic about useful third firefox add-ons - rearranged main documentation index - several assorted documentation fixes --HG-- rename : scrapy/trunk/docs/proposed/tutorial.rst => scrapy/trunk/docs/intro/tutorial.rst rename : scrapy/trunk/docs/intro/tutorial/scrot1.png => scrapy/trunk/docs/topics/_images/firebug1.png rename : scrapy/trunk/docs/intro/tutorial/scrot2.png => scrapy/trunk/docs/topics/_images/firebug2.png rename : scrapy/trunk/docs/intro/tutorial/scrot3.png => scrapy/trunk/docs/topics/_images/firebug3.png extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401038 2009-04-10 05:35:53 +00:00			`===================`
			`XPath Selectors API`
			`===================`
massive improvements to xpath selectors doc. refs #25 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401033 2009-04-03 01:33:52 +00:00
minor changes to some module descriptions --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401035 2009-04-03 03:20:45 +00:00			`.. module:: scrapy.xpath`
			`:synopsis: XPath selectors classes`

massive improvements to xpath selectors doc. refs #25 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401033 2009-04-03 01:33:52 +00:00			`There are two types of selectors bundled with Scrapy:`
			:class:`HtmlXPathSelector` and :class:`XmlXPathSelector`. Both of them
			implement the same :class:`XPathSelector` interface. The only different is that
			`one is used to process HTML data and the other XML data.`

			`XPathSelector objects`
			`=====================`

			`.. class:: XPathSelector(response)`

			A :class:`XPathSelector` object is a wrapper over response to select
			`certain parts of its content.`

			A :class:`Request` object represents an HTTP request, which is usually
			`generated in the Spider and executed by the Downloader, and thus generating`
			a :class:`Response`.

			``url`` is a :class:`~scrapy.http.Response` object that will be used for
			`selecting and extracting data`


			`XPathSelector Methods`
			`---------------------`

			`.. method:: XPathSelector.x(xpath)`

			`Apply the given XPath relative to this XPathSelector and return a list`
			of :class:`XPathSelector` objects (ie. a :class:`XPathSelectorList`) with
			`the result.`

			``xpath`` is a string containing the XPath to apply

			`.. method:: XPathSelector.re(regex)`

			`Apply the given regex and return a list of unicode strings with the`
			`matches.`

			``regex`` can be either a compiled regular expression or a string which
			will be compiled to a regular expression using ``re.compile(regex)``

			`.. method:: XPathSelector.extract()`

			Return a unicode string with the content of this :class:`XPathSelector`
			`object.`

			`.. method:: XPathSelector.extract_unquoted()`

			Return a unicode string with the content of this :class:`XPathSelector`
			`without entities or CDATA. This method is intended to be use for text-only`
			selectors, like ``//h1/text()`` (but not ``//h1``). If it's used for
			:class:`XPathSelector` objects which don't select a textual content (ie. if
			`they contain tags), the output of this method is undefined.`

			`.. method:: XPathSelector.register_namespace(prefix, uri)`

			Register the given namespace to be used in this :class:`XPathSelector`.
			`Without registering namespaces you can't select or extract data from`
			`non-standard namespaces. See examples below.`

			`.. method:: XPathSelector.__nonzero__()`

			Returns ``True`` if there is any real content selected by this
			:class:`XPathSelector` or ``False`` otherwise. In other words, the boolean
			`value of an XPathSelector is given by the contents it selects.`

			`XPathSelectorList objects`
			`=========================`

			`.. class:: XPathSelectorList`

			The :class:`XPathSelectorList` class is subclass of the builtin ``list``
			`class, which provides a few additional methods.`


			`XPathSelectorList Methods`
			`-------------------------`

			`.. method:: XPathSelectorList.x(xpath)`

			Call the :meth:`XPathSelector.re` method for all :class:`XPathSelector`
			`objects in this list and return their results flattened, as new`
			:class:`XPathSelectorList`.

			``xpath`` is the same argument as the one in :meth:`XPathSelector.x`

			`.. method:: XPathSelector.re(regex)`

			Call the :meth:`XPathSelector.re` method for all :class:`XPathSelector`
			`objects in this list and return their results flattened, as a list of`
			`unicode strings.`

			``regex`` is the same argument as the one in :meth:`XPathSelector.re`

			`.. method:: XPathSelector.extract()`

			Call the :meth:`XPathSelector.re` method for all :class:`XPathSelector`
			`objects in this list and return their results flattened, as a list of`
			`unicode strings.`

			`.. method:: XPathSelector.extract_unquoted()`

			Call the :meth:`XPathSelector.extract_unoquoted` method for all
			:class:`XPathSelector` objects in this list and return their results
			`flattened, as a list of unicode strings. This method should not be applied`
			`to all kinds of XPathSelectors. For more info see`
			:meth:`XPathSelector.extract_unoquoted`.

			`HtmlXPathSelector objects`
			`=========================`

			`.. class:: HtmlXPathSelector(response)`

			A subclass of :class:`XPathSelector` for working with HTML content. It uses
			the `libxml2`_ HTML parser. See the :class:`XPathSelector` API for more info.

			`.. _libxml2: http://xmlsoft.org/`

			`HtmlXPathSelector examples`
			`--------------------------`

			Here's a couple of :class:`HtmlXPathSelector` examples to illustrate several
			concepts. In all cases we assume there is already a :class:`HtmlPathSelector`
			instanced with a :class:`~scrapy.http.Response` object like this::

			`x = HtmlXPathSelector(html_response)`

			1. Select all ``<h1>`` elements from a HTML response body, returning a list of
			:class:`XPathSelector` objects (ie. a :class:`XPathSelectorList` object)::

			`x.x("//h1")`

			2. Extract the text of all ``<h1>`` elements from a HTML response body,
			`returning a list of unicode strings::`

			`x.x("//h1").extract() # this includes the h1 tag`
			`x.x("//h1/text()").extract() # this excludes the h1 tag`

			3. Iterate over all ``<p>`` tags and print their class attribute::

			`for node in x.x("//p"):`
			`... print node.x("@href")`

			4. Extract textual data from all ``<p>`` tags without entities, as a list of
			`unicode strings::`

			`x.x("//p/text()").extract_unquoted()`

			`# the following line is wrong. extract_unquoted() should only be used`
			`# with textual XPathSelectors`
			`x.x("//p").extract_unquoted() # it may work but output is unpredictable`

			`XmlXPathSelector objects`
			`========================`

			`.. class:: XmlXPathSelector(response)`

			A subclass of :class:`XPathSelector` for working with XML content. It uses
			the `libxml2`_ XML parser. See the :class:`XPathSelector` API for more info.

			`XmlXPathSelector examples`
			`-------------------------`

			Here's a couple of :class:`XmlXPathSelector` examples to illustrate several
			concepts. In all cases we assume there is already a :class:`XmlPathSelector`
			instanced with a :class:`~scrapy.http.Response` object like this::

			`x = HtmlXPathSelector(xml_response)`

			1. Select all ``<product>`` elements from a XML response body, returning a list of
			:class:`XPathSelector` objects (ie. a :class:`XPathSelectorList` object)::

			`x.x("//h1")`

			2. Extract all prices from a `Google Base XML feed`_ which requires registering
			`a namespace::`

			`x.register_namespace("g", "http://base.google.com/ns/1.0")`
			`x.x("//g:price").extract()`

			`.. _Google Base XML feed: http://base.google.com/support/bin/answer.py?hl=en&answer=59461`