1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-25 11:24:24 +00:00

Merge pull request #426 from scrapy/selectors-unified

[MRG] Selectors unified API
This commit is contained in:
Daniel Graña 2013-10-16 12:58:12 -07:00
commit 289688e39e
39 changed files with 1035 additions and 1074 deletions

View File

@ -143,13 +143,12 @@ Finally, here's the spider code::
rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]
def parse_torrent(self, response):
x = HtmlXPathSelector(response)
sel = Selector(response)
torrent = TorrentItem()
torrent['url'] = response.url
torrent['name'] = x.select("//h1/text()").extract()
torrent['description'] = x.select("//div[@id='description']").extract()
torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
torrent['name'] = sel.xpath("//h1/text()").extract()
torrent['description'] = sel.xpath("//div[@id='description']").extract()
torrent['size'] = sel.xpath("//div[@id='info-left']/p[2]/text()[2]").extract()
return torrent
For brevity's sake, we intentionally left out the import statements. The

View File

@ -183,11 +183,12 @@ Introduction to Selectors
^^^^^^^^^^^^^^^^^^^^^^^^^
There are several ways to extract data from web pages. Scrapy uses a mechanism
based on `XPath`_ expressions called :ref:`XPath selectors <topics-selectors>`.
For more information about selectors and other extraction mechanisms see the
:ref:`XPath selectors documentation <topics-selectors>`.
based on `XPath`_ or `CSS`_ expressions called :ref:`Scrapy Selectors
<topics-selectors>`. For more information about selectors and other extraction
mechanisms see the :ref:`Selectors documentation <topics-selectors>`.
.. _XPath: http://www.w3.org/TR/xpath
.. _CSS: http://www.w3.org/TR/selectors
Here are some examples of XPath expressions and their meanings:
@ -206,27 +207,28 @@ These are just a couple of simple examples of what you can do with XPath, but
XPath expressions are indeed much more powerful. To learn more about XPath we
recommend `this XPath tutorial <http://www.w3schools.com/XPath/default.asp>`_.
For working with XPaths, Scrapy provides a :class:`~scrapy.selector.XPathSelector`
class, which comes in two flavours, :class:`~scrapy.selector.HtmlXPathSelector`
(for HTML data) and :class:`~scrapy.selector.XmlXPathSelector` (for XML data). In
order to use them you must instantiate the desired class with a
:class:`~scrapy.http.Response` object.
For working with XPaths, Scrapy provides a :class:`~scrapy.selector.Selector`
class, it is instantiated with a :class:`~scrapy.http.HtmlResponse` or
:class:`~scrapy.http.XmlResponse` object as first argument.
You can see selectors as objects that represent nodes in the document
structure. So, the first instantiated selectors are associated to the root
node, or the entire document.
Selectors have three methods (click on the method to see the complete API
Selectors have four basic methods (click on the method to see the complete API
documentation).
* :meth:`~scrapy.selector.XPathSelector.select`: returns a list of selectors, each of
* :meth:`~scrapy.selector.Selector.xpath`: returns a list of selectors, each of
them representing the nodes selected by the xpath expression given as
argument.
argument.
* :meth:`~scrapy.selector.XPathSelector.extract`: returns a unicode string with
the data selected by the XPath selector.
* :meth:`~scrapy.selector.Selector.xpath`: returns a list of selectors, each of
them representing the nodes selected by the CSS expression given as argument.
* :meth:`~scrapy.selector.XPathSelector.re`: returns a list of unicode strings
* :meth:`~scrapy.selector.Selector.extract`: returns a unicode string with the
selected data.
* :meth:`~scrapy.selector.Selector.re`: returns a list of unicode strings
extracted by applying the regular expression given as argument.
@ -253,12 +255,11 @@ This is what the shell looks like::
[s] Available Scrapy objects:
[s] 2010-08-19 21:45:59-0300 [default] INFO: Spider closed (finished)
[s] hxs <HtmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>
[s] sel <Selector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>
[s] item Item()
[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] spider <BaseSpider 'default' at 0x1b6c2d0>
[s] xxs <XmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>
[s] Useful shortcuts:
[s] shelp() Print this help
[s] fetch(req_or_url) Fetch a new request or URL and update shell objects
@ -270,23 +271,25 @@ After the shell loads, you will have the response fetched in a local
``response`` variable, so if you type ``response.body`` you will see the body
of the response, or you can type ``response.headers`` to see its headers.
The shell also instantiates two selectors, one for HTML (in the ``hxs``
variable) and one for XML (in the ``xxs`` variable) with this response. So let's
try them::
The shell also pre-instantiate a selector for this response in variable ``sel``,
the selector automatically chooses the best parsing rules (XML vs HTML) based
on response's type.
In [1]: hxs.select('//title')
Out[1]: [<HtmlXPathSelector (title) xpath=//title>]
So let's try it::
In [2]: hxs.select('//title').extract()
In [1]: sel.xpath('//title')
Out[1]: [<Selector (title) xpath=//title>]
In [2]: sel.xpath('//title').extract()
Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']
In [3]: hxs.select('//title/text()')
Out[3]: [<HtmlXPathSelector (text) xpath=//title/text()>]
In [3]: sel.xpath('//title/text()')
Out[3]: [<Selector (text) xpath=//title/text()>]
In [4]: hxs.select('//title/text()').extract()
In [4]: sel.xpath('//title/text()').extract()
Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']
In [5]: hxs.select('//title/text()').re('(\w+):')
In [5]: sel.xpath('//title/text()').re('(\w+):')
Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']
Extracting the data
@ -306,29 +309,29 @@ is inside a ``<ul>`` element, in fact the *second* ``<ul>`` element.
So we can select each ``<li>`` element belonging to the sites list with this
code::
hxs.select('//ul/li')
sel.xpath('//ul/li')
And from them, the sites descriptions::
hxs.select('//ul/li/text()').extract()
sel.xpath('//ul/li/text()').extract()
The sites titles::
hxs.select('//ul/li/a/text()').extract()
sel.xpath('//ul/li/a/text()').extract()
And the sites links::
hxs.select('//ul/li/a/@href').extract()
sel.xpath('//ul/li/a/@href').extract()
As we said before, each ``select()`` call returns a list of selectors, so we can
concatenate further ``select()`` calls to dig deeper into a node. We are going to use
As we said before, each ``.xpath()`` call returns a list of selectors, so we can
concatenate further ``.xpath()`` calls to dig deeper into a node. We are going to use
that property here, so::
sites = hxs.select('//ul/li')
sites = sel.xpath('//ul/li')
for site in sites:
title = site.select('a/text()').extract()
link = site.select('a/@href').extract()
desc = site.select('text()').extract()
title = site.xpath('a/text()').extract()
link = site.xpath('a/@href').extract()
desc = site.xpath('text()').extract()
print title, link, desc
.. note::
@ -341,7 +344,7 @@ that property here, so::
Let's add this code to our spider::
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
class DmozSpider(BaseSpider):
name = "dmoz"
@ -352,12 +355,12 @@ Let's add this code to our spider::
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
sel = Selector(response)
sites = sel.xpath('//ul/li')
for site in sites:
title = site.select('a/text()').extract()
link = site.select('a/@href').extract()
desc = site.select('text()').extract()
title = site.xpath('a/text()').extract()
link = site.xpath('a/@href').extract()
desc = site.xpath('text()').extract()
print title, link, desc
Now try crawling the dmoz.org domain again and you'll see sites being printed
@ -382,7 +385,7 @@ Spiders are expected to return their scraped data inside
scraped so far, the final code for our Spider would be like this::
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from tutorial.items import DmozItem
@ -395,14 +398,14 @@ scraped so far, the final code for our Spider would be like this::
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
sel = Selector(response)
sites = sel.xpath('//ul/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/@href').extract()
item['desc'] = site.select('text()').extract()
item['title'] = site.xpath('a/text()').extract()
item['link'] = site.xpath('a/@href').extract()
item['desc'] = site.xpath('text()').extract()
items.append(item)
return items

View File

@ -9,6 +9,9 @@ Release notes
- Request/Response url/body attributes are now immutable (modifying them had
been deprecated for a long time)
- :setting:`ITEM_PIPELINES` is now defined as a dict (instead of a list)
- Dropped libxml2 selectors backend
- Dropped support for multiple selectors backends, sticking to lxml only
- Selector Unified API with support for CSS expressions (:issue:`395` and :issue:`426`)
0.18.4 (released 2013-10-10)
----------------------------

View File

@ -248,7 +248,6 @@ Memory debugger extension
An extension for debugging memory usage. It collects information about:
* objects uncollected by the Python garbage collector
* libxml2 memory leaks
* objects left alive that shouldn't. For more info, see :ref:`topics-leaks-trackrefs`
To enable this extension, turn on the :setting:`MEMDEBUG_ENABLED` setting. The

View File

@ -107,7 +107,7 @@ Now we're going to write the code to extract data from those pages.
With the help of Firebug, we'll take a look at some page containing links to
websites (say http://directory.google.com/Top/Arts/Awards/) and find out how we can
extract those links using :ref:`XPath selectors <topics-selectors>`. We'll also
extract those links using :ref:`Selectors <topics-selectors>`. We'll also
use the :ref:`Scrapy shell <topics-shell>` to test those XPath's and make sure
they work as we expect.
@ -146,16 +146,16 @@ that have that grey colour of the links,
Finally, we can write our ``parse_category()`` method::
def parse_category(self, response):
hxs = HtmlXPathSelector(response)
sel = Selector(response)
# The path to website links in directory page
links = hxs.select('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td/font')
links = sel.xpath('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td/font')
for link in links:
item = DirectoryItem()
item['name'] = link.select('a/text()').extract()
item['url'] = link.select('a/@href').extract()
item['description'] = link.select('font[2]/text()').extract()
item['name'] = link.xpath('a/text()').extract()
item['url'] = link.xpath('a/@href').extract()
item['description'] = link.xpath('font[2]/text()').extract()
yield item

View File

@ -67,7 +67,7 @@ alias to the :func:`~scrapy.utils.trackref.print_live_refs` function::
ExampleSpider 1 oldest: 15s ago
HtmlResponse 10 oldest: 1s ago
XPathSelector 2 oldest: 0s ago
Selector 2 oldest: 0s ago
FormRequest 878 oldest: 7s ago
As you can see, that report also shows the "age" of the oldest object in each
@ -87,9 +87,8 @@ subclasses):
* ``scrapy.http.Request``
* ``scrapy.http.Response``
* ``scrapy.item.Item``
* ``scrapy.selector.XPathSelector``
* ``scrapy.selector.Selector``
* ``scrapy.spider.BaseSpider``
* ``scrapy.selector.document.Libxml2Document``
A real example
--------------
@ -117,7 +116,7 @@ references::
SomenastySpider 1 oldest: 15s ago
HtmlResponse 3890 oldest: 265s ago
XPathSelector 2 oldest: 0s ago
Selector 2 oldest: 0s ago
Request 3878 oldest: 250s ago
The fact that there are so many live responses (and that they're so old) is

View File

@ -31,7 +31,7 @@ using the Item class specified in the :attr:`ItemLoader.default_item_class`
attribute.
Then, you start collecting values into the Item Loader, typically using
:ref:`XPath Selectors <topics-selectors>`. You can add more than one value to
:ref:`Selectors <topics-selectors>`. You can add more than one value to
the same item field; the Item Loader will know how to "join" those values later
using a proper processing function.
@ -352,14 +352,14 @@ ItemLoader objects
The :class:`XPathItemLoader` class extends the :class:`ItemLoader` class
providing more convenient mechanisms for extracting data from web pages
using :ref:`XPath selectors <topics-selectors>`.
using :ref:`selectors <topics-selectors>`.
:class:`XPathItemLoader` objects accept two more additional parameters in
their constructors:
:param selector: The selector to extract data from, when using the
:meth:`add_xpath` or :meth:`replace_xpath` method.
:type selector: :class:`~scrapy.selector.XPathSelector` object
:type selector: :class:`~scrapy.selector.Selector` object
:param response: The response used to construct the selector using the
:attr:`default_selector_class`, unless the selector argument is given,
@ -418,7 +418,7 @@ ItemLoader objects
.. attribute:: selector
The :class:`~scrapy.selector.XPathSelector` object to extract data from.
The :class:`~scrapy.selector.Selector` object to extract data from.
It's either the selector given in the constructor or one created from
the response given in the constructor using the
:attr:`default_selector_class`. This attribute is meant to be
@ -592,7 +592,7 @@ Here is a list of all built-in processors:
work with single values (instead of iterables). For this reason the
:class:`MapCompose` processor is typically used as input processor, since
data is often extracted using the
:meth:`~scrapy.selector.XPathSelector.extract` method of :ref:`selectors
:meth:`~scrapy.selector.Selector.extract` method of :ref:`selectors
<topics-selectors>`, which returns a list of unicode strings.
The example below should clarify how it works::

View File

@ -6,39 +6,43 @@ Selectors
When you're scraping web pages, the most common task you need to perform is
to extract data from the HTML source. There are several libraries available to
achieve this:
achieve this:
* `BeautifulSoup`_ is a very popular screen scraping library among Python
programmers which constructs a Python object based on the
structure of the HTML code and also deals with bad markup reasonably well,
but it has one drawback: it's slow.
programmers which constructs a Python object based on the structure of the
HTML code and also deals with bad markup reasonably well, but it has one
drawback: it's slow.
* `lxml`_ is a XML parsing library (which also parses HTML) with a pythonic
API based on `ElementTree`_ (which is not part of the Python standard
library).
Scrapy comes with its own mechanism for extracting data. They're called XPath
selectors (or just "selectors", for short) because they "select" certain parts
of the HTML document specified by `XPath`_ expressions.
Scrapy comes with its own mechanism for extracting data. They're called
selectors because they "select" certain parts of the HTML document specified
either by `XPath`_ or `CSS`_ expressions.
`XPath`_ is a language for selecting nodes in XML documents, which can also be used with HTML.
`XPath`_ is a language for selecting nodes in XML documents, which can also be
used with HTML. `CSS`_ is a language for applying styles to HTML documents. It
defines selectors to associate those styles with specific HTML elements.
Both `lxml`_ and Scrapy Selectors are built over the `libxml2`_ library, which
means they're very similar in speed and parsing accuracy.
Scrapy selectors are built over the `lxml`_ library, which means they're very
similar in speed and parsing accuracy.
This page explains how selectors work and describes their API which is very
small and simple, unlike the `lxml`_ API which is much bigger because the
`lxml`_ library can be used for many other tasks, besides selecting markup
documents.
For a complete reference of the selectors API see the :ref:`XPath selector
reference <topics-selectors-ref>`.
For a complete reference of the selectors API see
:ref:`Selector reference <topics-selectors-ref>`
.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
.. _lxml: http://codespeak.net/lxml/
.. _ElementTree: http://docs.python.org/library/xml.etree.elementtree.html
.. _libxml2: http://xmlsoft.org/
.. _cssselect: https://pypi.python.org/pypi/cssselect/
.. _XPath: http://www.w3.org/TR/xpath
.. _CSS: http://www.w3.org/TR/selectors
Using selectors
===============
@ -46,24 +50,29 @@ Using selectors
Constructing selectors
----------------------
There are two types of selectors bundled with Scrapy. Those are:
* :class:`~scrapy.selector.HtmlXPathSelector` - for working with HTML documents
* :class:`~scrapy.selector.XmlXPathSelector` - for working with XML documents
.. highlight:: python
Both share the same selector API, and are constructed with a Response object as
their first parameter. This is the Response they're going to be "selecting".
Scrapy selectors are instances of :class:`~scrapy.selector.Selector` class
constructed by passing a `Response` object as first argument, the response's
body is what they're going to be "selecting"::
Example::
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
hxs = HtmlXPathSelector(response) # a HTML selector
xxs = XmlXPathSelector(response) # a XML selector
class MySpider(BaseSpider):
# ...
def parse(self, response):
sel = Selector(response)
# Using XPath query
print sel.xpath('//p')
# Using CSS query
print sel.css('p')
# Nesting queries
print sel.xpath('//div[@foo="bar"]').css('span#bold')
Using selectors with XPaths
---------------------------
Using selectors
---------------
To explain how to use the selectors we'll use the `Scrapy shell` (which
provides interactive testing) and an example page located in the Scrapy
@ -84,78 +93,82 @@ First, let's open the shell::
scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
Then, after the shell loads, you'll have some selectors already instantiated and
ready to use.
Then, after the shell loads, you'll have a selector already instantiated and
ready to use in ``sel`` shell variable.
Since we're dealing with HTML, we'll be using the
:class:`~scrapy.selector.HtmlXPathSelector` object which is found, by default, in
the ``hxs`` shell variable.
Since we're dealing with HTML, the selector will automatically use an HTML parser.
.. highlight:: python
So, by looking at the :ref:`HTML code <topics-selectors-htmlcode>` of that page,
let's construct an XPath (using an HTML selector) for selecting the text inside
the title tag::
So, by looking at the :ref:`HTML code <topics-selectors-htmlcode>` of that
page, let's construct an XPath (using an HTML selector) for selecting the text
inside the title tag::
>>> hxs.select('//title/text()')
[<HtmlXPathSelector (text) xpath=//title/text()>]
>>> sel.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
As you can see, the select() method returns an XPathSelectorList, which is a list of
new selectors. This API can be used quickly for extracting nested data.
As you can see, the ``.xpath()`` method returns an
:class:`~scrapy.selector.SelectorList` instance, which is a list of new
selectors. This API can be used quickly for extracting nested data.
To actually extract the textual data, you must call the selector ``extract()``
To actually extract the textual data, you must call the selector ``.extract()``
method, as follows::
>>> hxs.select('//title/text()').extract()
>>> sel.xpath('//title/text()').extract()
[u'Example website']
Notice that CSS selectors can select text or attribute nodes using CSS3
pseudo-elements::
>>> sel.css('title::text').extract()
[u'Example website']
Now we're going to get the base URL and some image links::
>>> hxs.select('//base/@href').extract()
>>> sel.xpath('//base/@href').extract()
[u'http://example.com/']
>>> hxs.select('//a[contains(@href, "image")]/@href').extract()
>>> sel.css('base::attr(href)').extract()
[u'http://example.com/']
>>> sel.xpath('//a[contains(@href, "image")]/@href').extract()
[u'image1.html',
u'image2.html',
u'image3.html',
u'image4.html',
u'image5.html']
>>> hxs.select('//a[contains(@href, "image")]/img/@src').extract()
>>> sel.css('a[href*=image]::attr(href)').extract()
[u'image1.html',
u'image2.html',
u'image3.html',
u'image4.html',
u'image5.html']
>>> sel.xpath('//a[contains(@href, "image")]/img/@src').extract()
[u'image1_thumb.jpg',
u'image2_thumb.jpg',
u'image3_thumb.jpg',
u'image4_thumb.jpg',
u'image5_thumb.jpg']
Using selectors with regular expressions
----------------------------------------
Selectors also have a ``re()`` method for extracting data using regular
expressions. However, unlike using the ``select()`` method, the ``re()`` method
does not return a list of :class:`~scrapy.selector.XPathSelector` objects, so you
can't construct nested ``.re()`` calls.
Here's an example used to extract images names from the :ref:`HTML code
<topics-selectors-htmlcode>` above::
>>> hxs.select('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
[u'My image 1',
u'My image 2',
u'My image 3',
u'My image 4',
u'My image 5']
>>> sel.css('a[href*=image] img::attr(src)').extract()
[u'image1_thumb.jpg',
u'image2_thumb.jpg',
u'image3_thumb.jpg',
u'image4_thumb.jpg',
u'image5_thumb.jpg']
.. _topics-selectors-nesting-selectors:
Nesting selectors
-----------------
The ``select()`` selector method returns a list of selectors, so you can call the
``select()`` for those selectors too. Here's an example::
The selection methods (``.xpath()`` or ``.css()``) returns a list of selectors
of the same type, so you can call the selection methods for those selectors
too. Here's an example::
>>> links = hxs.select('//a[contains(@href, "image")]')
>>> links = sel.xpath('//a[contains(@href, "image")]')
>>> links.extract()
[u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
@ -164,7 +177,7 @@ The ``select()`` selector method returns a list of selectors, so you can call th
u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
>>> for index, link in enumerate(links):
args = (index, link.select('@href').extract(), link.select('img/@src').extract())
args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
print 'Link number %d points to url %s and image %s' % args
Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
@ -173,35 +186,53 @@ The ``select()`` selector method returns a list of selectors, so you can call th
Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']
Using selectors with regular expressions
----------------------------------------
:class:`~scrapy.selector.Selector` also have a ``.re()`` method for extracting
data using regular expressions. However, unlike using ``.xpath()`` or
``.css()`` methods, ``.re()`` method returns a list of unicode strings. So you
can't construct nested ``.re()`` calls.
Here's an example used to extract images names from the :ref:`HTML code
<topics-selectors-htmlcode>` above::
>>> sel.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
[u'My image 1',
u'My image 2',
u'My image 3',
u'My image 4',
u'My image 5']
.. _topics-selectors-relative-xpaths:
Working with relative XPaths
----------------------------
Keep in mind that if you are nesting XPathSelectors and use an XPath that
starts with ``/``, that XPath will be absolute to the document and not relative
to the ``XPathSelector`` you're calling it from.
Keep in mind that if you are nesting selectors and use an XPath that starts
with ``/``, that XPath will be absolute to the document and not relative to the
``Selector`` you're calling it from.
For example, suppose you want to extract all ``<p>`` elements inside ``<div>``
elements. First, you would get all ``<div>`` elements::
>>> divs = hxs.select('//div')
>>> divs = sel.xpath('//div')
At first, you may be tempted to use the following approach, which is wrong, as
it actually extracts all ``<p>`` elements from the document, not only those
inside ``<div>`` elements::
>>> for p in divs.select('//p') # this is wrong - gets all <p> from the whole document
>>> for p in divs.xpath('//p') # this is wrong - gets all <p> from the whole document
>>> print p.extract()
This is the proper way to do it (note the dot prefixing the ``.//p`` XPath)::
>>> for p in divs.select('.//p') # extracts all <p> inside
>>> for p in divs.xpath('.//p') # extracts all <p> inside
>>> print p.extract()
Another common case would be to extract all direct ``<p>`` children::
>>> for p in divs.select('p')
>>> for p in divs.xpath('p')
>>> print p.extract()
For more details about relative XPaths see the `Location Paths`_ section in the
@ -212,175 +243,170 @@ XPath specification.
.. _topics-selectors-ref:
Built-in XPath Selectors reference
==================================
Built-in Selectors reference
============================
.. module:: scrapy.selector
:synopsis: XPath selectors classes
:synopsis: Selector class
There are two types of selectors bundled with Scrapy:
:class:`HtmlXPathSelector` and :class:`XmlXPathSelector`. Both of them
implement the same :class:`XPathSelector` interface. The only different is that
one is used to process HTML data and the other XML data.
.. class:: Selector(response=None, text=None, type=None)
XPathSelector objects
---------------------
An instance of :class:`Selector` is a wrapper over response to select
certain parts of its content.
.. class:: XPathSelector(response)
``response`` is a :class:`~scrapy.http.HtmlResponse` or
:class:`~scrapy.http.XmlResponse` object that will be used for selecting and
extracting data.
A :class:`XPathSelector` object is a wrapper over response to select
certain parts of its content.
``text`` is a unicode string or utf-8 encoded text for cases when a
``response`` isn't available. Using ``text`` and ``response`` together is
undefined behavior.
``response`` is a :class:`~scrapy.http.Response` object that will be used
for selecting and extracting data
``type`` defines the selector type, it can be ``"html"``, ``"xml"`` or ``None`` (default).
.. method:: select(xpath)
If ``type`` is ``None``, the selector automatically chooses the best type
based on ``response`` type (see below), or defaults to ``"html"`` in case it
is used together with ``text``.
Apply the given XPath relative to this XPathSelector and return a list
of :class:`XPathSelector` objects (ie. a :class:`XPathSelectorList`) with
the result.
If ``type`` is ``None`` and a ``response`` is passed, the selector type is
inferred from the response type as follow:
``xpath`` is a string containing the XPath to apply
* ``"html"`` for :class:`~scrapy.http.HtmlResponse` type
* ``"xml"`` for :class:`~scrapy.http.XmlResponse` type
* ``"html"`` for anything else
.. method:: re(regex)
Otherwise, if ``type`` is set, the selector type will be forced and no
detection will occur.
Apply the given regex and return a list of unicode strings with the
matches.
.. method:: xpath(query)
``regex`` can be either a compiled regular expression or a string which
will be compiled to a regular expression using ``re.compile(regex)``
Find nodes matching the xpath ``query`` and return the result as a
:class:`SelectorList` instance with all elements flattened. List
elements implement :class:`Selector` interface too.
``query`` is a string containing the XPATH query to apply.
.. method:: css(query)
Apply the given CSS selector and return a :class:`SelectorList` instance.
``query`` is a string containing the CSS selector to apply.
In the background, CSS queries are translated into XPath queries using
`cssselect`_ library and run ``.xpath()`` method.
.. method:: extract()
Serialize and return the matched nodes as a list of unicode strings.
Percent encoded content is unquoted.
.. method:: re(regex)
Apply the given regex and return a list of unicode strings with the
matches.
``regex`` can be either a compiled regular expression or a string which
will be compiled to a regular expression using ``re.compile(regex)``
.. method:: register_namespace(prefix, uri)
Register the given namespace to be used in this :class:`Selector`.
Without registering namespaces you can't select or extract data from
non-standard namespaces. See examples below.
.. method:: remove_namespaces()
Remove all namespaces, allowing to traverse the document using
namespace-less xpaths. See example below.
.. method:: __nonzero__()
Returns ``True`` if there is any real content selected or ``False``
otherwise. In other words, the boolean value of a :class:`Selector` is
given by the contents it selects.
SelectorList objects
--------------------
.. class:: SelectorList
The :class:`SelectorList` class is subclass of the builtin ``list``
class, which provides a few additional methods.
.. method:: xpath(query)
Call the ``.xpath()`` method for each element in this list and return
their results flattened as another :class:`SelectorList`.
``query`` is the same argument as the one in :meth:`Selector.xpath`
.. method:: css(query)
Call the ``.css()`` method for each element in this list and return
their results flattened as another :class:`SelectorList`.
``query`` is the same argument as the one in :meth:`Selector.css`
.. method:: extract()
Return a unicode string with the content of this :class:`XPathSelector`
object.
Call the ``.extract()`` method for each element is this list and return
their results flattened, as a list of unicode strings.
.. method:: register_namespace(prefix, uri)
.. method:: re()
Register the given namespace to be used in this :class:`XPathSelector`.
Without registering namespaces you can't select or extract data from
non-standard namespaces. See examples below.
.. method:: remove_namespaces()
Remove all namespaces, allowing to traverse the document using
namespace-less xpaths. See example below.
Call the ``.re()`` method for each element is this list and return
their results flattened, as a list of unicode strings.
.. method:: __nonzero__()
Returns ``True`` if there is any real content selected by this
:class:`XPathSelector` or ``False`` otherwise. In other words, the boolean
value of an XPathSelector is given by the contents it selects.
returns True if the list is not empty, False otherwise.
XPathSelectorList objects
-------------------------
.. class:: XPathSelectorList
Selector examples on HTML response
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The :class:`XPathSelectorList` class is subclass of the builtin ``list``
class, which provides a few additional methods.
Here's a couple of :class:`Selector` examples to illustrate several concepts.
In all cases, we assume there is already an :class:`Selector` instantiated with
a :class:`~scrapy.http.HtmlResponse` object like this::
.. method:: select(xpath)
Call the :meth:`XPathSelector.select` method for all :class:`XPathSelector`
objects in this list and return their results flattened, as a new
:class:`XPathSelectorList`.
``xpath`` is the same argument as the one in :meth:`XPathSelector.select`
.. method:: re(regex)
Call the :meth:`XPathSelector.re` method for all :class:`XPathSelector`
objects in this list and return their results flattened, as a list of
unicode strings.
``regex`` is the same argument as the one in :meth:`XPathSelector.re`
.. method:: extract()
Call the :meth:`XPathSelector.extract` method for all :class:`XPathSelector`
objects in this list and return their results flattened, as a list of
unicode strings.
.. method:: extract_unquoted()
Call the :meth:`XPathSelector.extract_unoquoted` method for all
:class:`XPathSelector` objects in this list and return their results
flattened, as a list of unicode strings. This method should not be applied
to all kinds of XPathSelectors. For more info see
:meth:`XPathSelector.extract_unoquoted`.
HtmlXPathSelector objects
-------------------------
.. class:: HtmlXPathSelector(response)
A subclass of :class:`XPathSelector` for working with HTML content. It uses
the `libxml2`_ HTML parser. See the :class:`XPathSelector` API for more info.
.. _libxml2: http://xmlsoft.org/
HtmlXPathSelector examples
~~~~~~~~~~~~~~~~~~~~~~~~~~
Here's a couple of :class:`HtmlXPathSelector` examples to illustrate several
concepts. In all cases, we assume there is already an :class:`HtmlPathSelector`
instantiated with a :class:`~scrapy.http.Response` object like this::
x = HtmlXPathSelector(html_response)
x = Selector(html_response)
1. Select all ``<h1>`` elements from a HTML response body, returning a list of
:class:`XPathSelector` objects (ie. a :class:`XPathSelectorList` object)::
:class:`Selector` objects (ie. a :class:`SelectorList` object)::
x.select("//h1")
x.xpath("//h1")
2. Extract the text of all ``<h1>`` elements from a HTML response body,
returning a list of unicode strings::
x.select("//h1").extract() # this includes the h1 tag
x.select("//h1/text()").extract() # this excludes the h1 tag
x.xpath("//h1").extract() # this includes the h1 tag
x.xpath("//h1/text()").extract() # this excludes the h1 tag
3. Iterate over all ``<p>`` tags and print their class attribute::
for node in x.select("//p"):
... print node.select("@href")
for node in x.xpath("//p"):
... print node.xpath("@class").extract()
4. Extract textual data from all ``<p>`` tags without entities, as a list of
unicode strings::
Selector examples on XML response
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
x.select("//p/text()").extract_unquoted()
Here's a couple of examples to illustrate several concepts. In both cases we
assume there is already an :class:`Selector` instantiated with a
:class:`~scrapy.http.XmlResponse` object like this::
# the following line is wrong. extract_unquoted() should only be used
# with textual XPathSelectors
x.select("//p").extract_unquoted() # it may work but output is unpredictable
x = Selector(xml_response)
XmlXPathSelector objects
------------------------
1. Select all ``<product>`` elements from a XML response body, returning a list
of :class:`Selector` objects (ie. a :class:`SelectorList` object)::
.. class:: XmlXPathSelector(response)
A subclass of :class:`XPathSelector` for working with XML content. It uses
the `libxml2`_ XML parser. See the :class:`XPathSelector` API for more info.
XmlXPathSelector examples
~~~~~~~~~~~~~~~~~~~~~~~~~
Here's a couple of :class:`XmlXPathSelector` examples to illustrate several
concepts. In both cases we assume there is already an :class:`XmlXPathSelector`
instantiated with a :class:`~scrapy.http.Response` object like this::
x = XmlXPathSelector(xml_response)
1. Select all ``<product>`` elements from a XML response body, returning a list of
:class:`XPathSelector` objects (ie. a :class:`XPathSelectorList` object)::
x.select("//product")
x.xpath("//product")
2. Extract all prices from a `Google Base XML feed`_ which requires registering
a namespace::
x.register_namespace("g", "http://base.google.com/ns/1.0")
x.select("//g:price").extract()
x.xpath("//g:price").extract()
.. _removing-namespaces:
@ -390,7 +416,7 @@ Removing namespaces
When dealing with scraping projects, it is often quite convenient to get rid of
namespaces altogether and just work with element names, to write more
simple/convenient XPaths. You can use the
:meth:`XPathSelector.remove_namespaces` method for that.
:meth:`Selector.remove_namespaces` method for that.
Let's show an example that illustrates this with Github blog atom feed.
@ -401,27 +427,27 @@ First, we open the shell with the url we want to scrape::
Once in the shell we can try selecting all ``<link>`` objects and see that it
doesn't work (because the Atom XML namespace is obfuscating those nodes)::
>>> xxs.select("//link")
>>> xxs.xpath("//link")
[]
But once we call the :meth:`XPathSelector.remove_namespaces` method, all
But once we call the :meth:`Selector.remove_namespaces` method, all
nodes can be accessed directly by their names::
>>> xxs.remove_namespaces()
>>> xxs.select("//link")
[<XmlXPathSelector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
<XmlXPathSelector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
>>> xxs.xpath("//link")
[<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
...
If you wonder why the namespace removal procedure is not always called, instead
of having to call it manually. This is because of two reasons which, in order
of relevance, are:
1. removing namespaces requires to iterate and modify all nodes in the
1. Removing namespaces requires to iterate and modify all nodes in the
document, which is a reasonably expensive operation to performs for all
documents crawled by Scrapy
2. there could be some cases where using namespaces is actually required, in
2. There could be some cases where using namespaces is actually required, in
case some element names clash between namespaces. These cases are very rare
though.

View File

@ -9,10 +9,10 @@ scraping code very quickly, without having to run the spider. It's meant to be
used for testing data extraction code, but you can actually use it for testing
any kind of code as it is also a regular Python shell.
The shell is used for testing XPath expressions and see how they work and what
data they extract from the web pages you're trying to scrape. It allows you to
interactively test your XPaths while you're writing your spider, without having
to run the spider to test every change.
The shell is used for testing XPath or CSS expressions and see how they work
and what data they extract from the web pages you're trying to scrape. It
allows you to interactively test your expressions while you're writing your
spider, without having to run the spider to test every change.
Once you get familiarized with the Scrapy shell, you'll see that it's an
invaluable tool for developing and debugging your spiders.
@ -66,7 +66,7 @@ Available Scrapy objects
The Scrapy shell automatically creates some convenient objects from the
downloaded page, like the :class:`~scrapy.http.Response` object and the
:class:`~scrapy.selector.XPathSelector` objects (for both HTML and XML
:class:`~scrapy.selector.Selector` objects (for both HTML and XML
content).
Those objects are:
@ -83,10 +83,7 @@ Those objects are:
* ``response`` - a :class:`~scrapy.http.Response` object containing the last
fetched page
* ``hxs`` - a :class:`~scrapy.selector.HtmlXPathSelector` object constructed
with the last response fetched
* ``xxs`` - a :class:`~scrapy.selector.XmlXPathSelector` object constructed
* ``sel`` - a :class:`~scrapy.selector.Selector` object constructed
with the last response fetched
* ``settings`` - the current :ref:`Scrapy settings <topics-settings>`
@ -114,13 +111,12 @@ list of available objects and useful shortcuts (you'll notice that these lines
all start with the ``[s]`` prefix)::
[s] Available objects
[s] hxs <HtmlXPathSelector (http://scrapy.org) xpath=None>
[s] sel <Selector (http://scrapy.org) xpath=None>
[s] item Item()
[s] request <http://scrapy.org>
[s] response <http://scrapy.org>
[s] settings <Settings 'mybot.settings'>
[s] spider <scrapy.spider.models.BaseSpider object at 0x2bed9d0>
[s] xxs <XmlXPathSelector (http://scrapy.org) xpath=None>
[s] Useful shortcuts:
[s] shelp() Prints this help.
[s] fetch(req_or_url) Fetch a new request or URL and update objects
@ -130,24 +126,23 @@ all start with the ``[s]`` prefix)::
After that, we can star playing with the objects::
>>> hxs.select("//h2/text()").extract()[0]
>>> sel.xpath("//h2/text()").extract()[0]
u'Welcome to Scrapy'
>>> fetch("http://slashdot.org")
[s] Available Scrapy objects:
[s] hxs <HtmlXPathSelector (http://slashdot.org) xpath=None>
[s] sel <Selector (http://slashdot.org) xpath=None>
[s] item JobItem()
[s] request <GET http://slashdot.org>
[s] response <200 http://slashdot.org>
[s] settings <Settings 'jobsbot.settings'>
[s] spider <BaseSpider 'default' at 0x3c44a10>
[s] xxs <XmlXPathSelector (http://slashdot.org) xpath=None>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
>>> hxs.select("//h2/text()").extract()
>>> sel.xpath("//h2/text()").extract()
[u'News for nerds, stuff that matters']
>>> request = request.replace(method="POST")
@ -185,7 +180,7 @@ When you run the spider, you will get something similar to this::
2009-08-27 19:15:25-0300 [example.com] DEBUG: Crawled <http://www.example.com/> (referer: <None>)
2009-08-27 19:15:26-0300 [example.com] DEBUG: Crawled <http://www.example.com/products.php> (referer: <http://www.example.com/>)
[s] Available objects
[s] hxs <HtmlXPathSelector (http://www.example.com/products.php) xpath=None>
[s] sel <Selector (http://www.example.com/products.php) xpath=None>
...
>>> response.url
@ -193,7 +188,7 @@ When you run the spider, you will get something similar to this::
Then, you can check if the extraction code is working::
>>> hxs.select('//h1')
>>> sel.xpath('//h1')
[]
Nope, it doesn't. So you can open the response in your web browser and see if

View File

@ -216,7 +216,7 @@ Let's see an example::
Another example returning multiples Requests and Items from a single callback::
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from scrapy.http import Request
from myproject.items import MyItem
@ -231,11 +231,11 @@ Another example returning multiples Requests and Items from a single callback::
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
for h3 in hxs.select('//h3').extract():
sel = Selector(response)
for h3 in sel.xpath('//h3').extract():
yield MyItem(title=h3)
for url in hxs.select('//a/@href').extract():
for url in sel.xpath('//a/@href').extract():
yield Request(url, callback=self.parse)
.. module:: scrapy.contrib.spiders
@ -314,7 +314,7 @@ Let's now take a look at an example CrawlSpider with rules::
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.item import Item
class MySpider(CrawlSpider):
@ -334,11 +334,11 @@ Let's now take a look at an example CrawlSpider with rules::
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
sel = Selector(response)
item = Item()
item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = hxs.select('//td[@id="item_name"]/text()').extract()
item['description'] = hxs.select('//td[@id="item_description"]/text()').extract()
item['id'] = sel.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = sel.xpath('//td[@id="item_name"]/text()').extract()
item['description'] = sel.xpath('//td[@id="item_description"]/text()').extract()
return item
@ -366,15 +366,15 @@ XMLFeedSpider
A string which defines the iterator to use. It can be either:
- ``'iternodes'`` - a fast iterator based on regular expressions
- ``'iternodes'`` - a fast iterator based on regular expressions
- ``'html'`` - an iterator which uses HtmlXPathSelector. Keep in mind
this uses DOM parsing and must load all DOM in memory which could be a
problem for big feeds
- ``'html'`` - an iterator which uses :class:`~scrapy.selector.Selector`.
Keep in mind this uses DOM parsing and must load all DOM in memory
which could be a problem for big feeds
- ``'xml'`` - an iterator which uses XmlXPathSelector. Keep in mind
this uses DOM parsing and must load all DOM in memory which could be a
problem for big feeds
- ``'xml'`` - an iterator which uses :class:`~scrapy.selector.Selector`.
Keep in mind this uses DOM parsing and must load all DOM in memory
which could be a problem for big feeds
It defaults to: ``'iternodes'``.
@ -390,7 +390,7 @@ XMLFeedSpider
available in that document that will be processed with this spider. The
``prefix`` and ``uri`` will be used to automatically register
namespaces using the
:meth:`~scrapy.selector.XPathSelector.register_namespace` method.
:meth:`~scrapy.selector.Selector.register_namespace` method.
You can then specify nodes with namespaces in the :attr:`itertag`
attribute.
@ -416,9 +416,10 @@ XMLFeedSpider
.. method:: parse_node(response, selector)
This method is called for the nodes matching the provided tag name
(``itertag``). Receives the response and an XPathSelector for each node.
Overriding this method is mandatory. Otherwise, you spider won't work.
This method must return either a :class:`~scrapy.item.Item` object, a
(``itertag``). Receives the response and an
:class:`~scrapy.selector.Selector` for each node. Overriding this
method is mandatory. Otherwise, you spider won't work. This method
must return either a :class:`~scrapy.item.Item` object, a
:class:`~scrapy.http.Request` object, or an iterable containing any of
them.
@ -451,9 +452,9 @@ These spiders are pretty easy to use, let's have a look at one example::
log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract())))
item = Item()
item['id'] = node.select('@id').extract()
item['name'] = node.select('name').extract()
item['description'] = node.select('description').extract()
item['id'] = node.xpath('@id').extract()
item['name'] = node.xpath('name').extract()
item['description'] = node.xpath('description').extract()
return item
Basically what we did up there was to create a spider that downloads a feed from

View File

@ -30,13 +30,6 @@ except ImportError:
else:
optional_features.add('boto')
try:
import libxml2
except ImportError:
pass
else:
optional_features.add('libxml2')
try:
import django
except ImportError:

View File

@ -6,6 +6,7 @@ import twisted
import scrapy
from scrapy.command import ScrapyCommand
class Command(ScrapyCommand):
def syntax(self):
@ -21,13 +22,9 @@ class Command(ScrapyCommand):
def run(self, args, opts):
if opts.verbose:
try:
import lxml.etree
except ImportError:
lxml_version = libxml2_version = "(lxml not available)"
else:
lxml_version = ".".join(map(str, lxml.etree.LXML_VERSION))
libxml2_version = ".".join(map(str, lxml.etree.LIBXML_VERSION))
import lxml.etree
lxml_version = ".".join(map(str, lxml.etree.LXML_VERSION))
libxml2_version = ".".join(map(str, lxml.etree.LIBXML_VERSION))
print "Scrapy : %s" % scrapy.__version__
print "lxml : %s" % lxml_version
print "libxml2 : %s" % libxml2_version

View File

@ -4,7 +4,7 @@ SGMLParser-based Link extractors
import re
from urlparse import urlparse, urljoin
from w3lib.url import safe_url_string
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.link import Link
from scrapy.linkextractor import IGNORED_EXTENSIONS
from scrapy.utils.misc import arg_to_iter
@ -116,11 +116,11 @@ class SgmlLinkExtractor(BaseSgmlLinkExtractor):
def extract_links(self, response):
base_url = None
if self.restrict_xpaths:
hxs = HtmlXPathSelector(response)
sel = Selector(response)
base_url = get_base_url(response)
body = u''.join(f
for x in self.restrict_xpaths
for f in hxs.select(x).extract()
for f in sel.xpath(x).extract()
).encode(response.encoding)
else:
body = response.body

View File

@ -8,7 +8,7 @@ from collections import defaultdict
import re
from scrapy.item import Item
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.utils.misc import arg_to_iter, extract_regex
from scrapy.utils.python import flatten
from .common import wrap_loader_context
@ -116,7 +116,7 @@ class ItemLoader(object):
class XPathItemLoader(ItemLoader):
default_selector_class = HtmlXPathSelector
default_selector_class = Selector
def __init__(self, item=None, selector=None, response=None, **context):
if selector is None and response is None:
@ -142,5 +142,4 @@ class XPathItemLoader(ItemLoader):
def _get_values(self, xpaths, **kw):
xpaths = arg_to_iter(xpaths)
return flatten([self.selector.select(xpath).extract() for xpath in xpaths])
return flatten([self.selector.xpath(xpath).extract() for xpath in xpaths])

View File

@ -10,14 +10,10 @@ from scrapy import signals
from scrapy.exceptions import NotConfigured
from scrapy.utils.trackref import live_refs
class MemoryDebugger(object):
def __init__(self, stats):
try:
import libxml2
self.libxml2 = libxml2
except ImportError:
self.libxml2 = None
self.stats = stats
@classmethod
@ -25,18 +21,10 @@ class MemoryDebugger(object):
if not crawler.settings.getbool('MEMDEBUG_ENABLED'):
raise NotConfigured
o = cls(crawler.stats)
crawler.signals.connect(o.engine_started, signals.engine_started)
crawler.signals.connect(o.engine_stopped, signals.engine_stopped)
return o
def engine_started(self):
if self.libxml2:
self.libxml2.debugMemory(1)
def engine_stopped(self):
if self.libxml2:
self.libxml2.cleanupParser()
self.stats.set_value('memdebug/libxml2_leaked_bytes', self.libxml2.debugMemory(1))
gc.collect()
self.stats.set_value('memdebug/gc_garbage_count', len(gc.garbage))
for cls, wdict in live_refs.iteritems():

View File

@ -9,7 +9,7 @@ from scrapy.item import BaseItem
from scrapy.http import Request
from scrapy.utils.iterators import xmliter, csviter
from scrapy.utils.spider import iterate_spider_output
from scrapy.selector import XmlXPathSelector, HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.exceptions import NotConfigured, NotSupported
@ -52,7 +52,7 @@ class XMLFeedSpider(BaseSpider):
def parse_nodes(self, response, nodes):
"""This method is called for the nodes matching the provided tag name
(itertag). Receives the response and an XPathSelector for each node.
(itertag). Receives the response and an Selector for each node.
Overriding this method is mandatory. Otherwise, you spider won't work.
This method must return either a BaseItem, a Request, or a list
containing any of them.
@ -71,13 +71,13 @@ class XMLFeedSpider(BaseSpider):
if self.iterator == 'iternodes':
nodes = self._iternodes(response)
elif self.iterator == 'xml':
selector = XmlXPathSelector(response)
selector = Selector(response, type='xml')
self._register_namespaces(selector)
nodes = selector.select('//%s' % self.itertag)
nodes = selector.xpath('//%s' % self.itertag)
elif self.iterator == 'html':
selector = HtmlXPathSelector(response)
selector = Selector(response, type='html')
self._register_namespaces(selector)
nodes = selector.select('//%s' % self.itertag)
nodes = selector.xpath('//%s' % self.itertag)
else:
raise NotSupported('Unsupported node iterator')

View File

@ -1,5 +1,5 @@
from scrapy.http import Response
from scrapy.selector import XmlXPathSelector
from scrapy.selector import Selector
def xmliter_lxml(obj, nodename, namespace=None):
@ -11,10 +11,10 @@ def xmliter_lxml(obj, nodename, namespace=None):
for _, node in iterable:
nodetext = etree.tostring(node)
node.clear()
xs = XmlXPathSelector(text=nodetext)
xs = Selector(text=nodetext, type='xml')
if namespace:
xs.register_namespace('x', namespace)
yield xs.select(selxpath)[0]
yield xs.xpath(selxpath)[0]
class _StreamReader(object):

View File

@ -1,26 +1,5 @@
"""
XPath selectors
To select the backend explicitly use the SCRAPY_SELECTORS_BACKEND environment
variable.
Two backends are currently available: lxml (default) and libxml2.
Selectors
"""
import os
backend = os.environ.get('SCRAPY_SELECTORS_BACKEND')
if backend == 'libxml2':
from scrapy.selector.libxml2sel import *
elif backend == 'lxml':
from scrapy.selector.lxmlsel import *
else:
try:
import lxml
except ImportError:
import libxml2
from scrapy.selector.libxml2sel import *
else:
from scrapy.selector.lxmlsel import *
from scrapy.selector.unified import *
from scrapy.selector.lxmlsel import *

View File

@ -0,0 +1,88 @@
from cssselect import GenericTranslator, HTMLTranslator
from cssselect.xpath import _unicode_safe_getattr, XPathExpr, ExpressionError
from cssselect.parser import FunctionalPseudoElement
class ScrapyXPathExpr(XPathExpr):
textnode = False
attribute = None
@classmethod
def from_xpath(cls, xpath, textnode=False, attribute=None):
x = cls(path=xpath.path, element=xpath.element, condition=xpath.condition)
x.textnode = textnode
x.attribute = attribute
return x
def __str__(self):
path = super(ScrapyXPathExpr, self).__str__()
if self.textnode:
if path == '*':
path = 'text()'
elif path.endswith('::*/*'):
path = path[:-3] + 'text()'
else:
path += '/text()'
if self.attribute is not None:
if path.endswith('::*/*'):
path = path[:-2]
path += '/@%s' % self.attribute
return path
def join(self, combiner, other):
super(ScrapyXPathExpr, self).join(combiner, other)
self.textnode = other.textnode
self.attribute = other.attribute
return self
class TranslatorMixin(object):
def xpath_element(self, selector):
xpath = super(TranslatorMixin, self).xpath_element(selector)
return ScrapyXPathExpr.from_xpath(xpath)
def xpath_pseudo_element(self, xpath, pseudo_element):
if isinstance(pseudo_element, FunctionalPseudoElement):
method = 'xpath_%s_functional_pseudo_element' % (
pseudo_element.name.replace('-', '_'))
method = _unicode_safe_getattr(self, method, None)
if not method:
raise ExpressionError(
"The functional pseudo-element ::%s() is unknown"
% pseudo_element.name)
xpath = method(xpath, pseudo_element)
else:
method = 'xpath_%s_simple_pseudo_element' % (
pseudo_element.replace('-', '_'))
method = _unicode_safe_getattr(self, method, None)
if not method:
raise ExpressionError(
"The pseudo-element ::%s is unknown"
% pseudo_element)
xpath = method(xpath)
return xpath
def xpath_attr_functional_pseudo_element(self, xpath, function):
if function.argument_types() not in (['STRING'], ['IDENT']):
raise ExpressionError(
"Expected a single string or ident for ::attr(), got %r"
% function.arguments)
return ScrapyXPathExpr.from_xpath(xpath,
attribute=function.arguments[0].value)
def xpath_text_simple_pseudo_element(self, xpath):
"""Support selecting text nodes using ::text pseudo-element"""
return ScrapyXPathExpr.from_xpath(xpath, textnode=True)
class ScrapyGenericTranslator(TranslatorMixin, GenericTranslator):
pass
class ScrapyHTMLTranslator(TranslatorMixin, HTMLTranslator):
pass

View File

@ -1,82 +0,0 @@
"""
This module contains a simple class (Libxml2Document) which provides cache and
garbage collection to libxml2 documents (xmlDoc).
"""
import weakref
from scrapy.utils.trackref import object_ref
from scrapy import optional_features
if 'libxml2' in optional_features:
import libxml2
xml_parser_options = libxml2.XML_PARSE_RECOVER + \
libxml2.XML_PARSE_NOERROR + \
libxml2.XML_PARSE_NOWARNING
html_parser_options = libxml2.HTML_PARSE_RECOVER + \
libxml2.HTML_PARSE_NOERROR + \
libxml2.HTML_PARSE_NOWARNING
_UTF8_ENCODINGS = set(('utf-8', 'UTF-8', 'utf8', 'UTF8'))
def _body_as_utf8(response):
if response.encoding in _UTF8_ENCODINGS:
return response.body
else:
return response.body_as_unicode().encode('utf-8')
def xmlDoc_from_html(response):
"""Return libxml2 doc for HTMLs"""
utf8body = _body_as_utf8(response) or ' '
try:
lxdoc = libxml2.htmlReadDoc(utf8body, response.url, 'utf-8', \
html_parser_options)
except TypeError: # libxml2 doesn't parse text with null bytes
lxdoc = libxml2.htmlReadDoc(utf8body.replace("\x00", ""), response.url, \
'utf-8', html_parser_options)
return lxdoc
def xmlDoc_from_xml(response):
"""Return libxml2 doc for XMLs"""
utf8body = _body_as_utf8(response) or ' '
try:
lxdoc = libxml2.readDoc(utf8body, response.url, 'utf-8', \
xml_parser_options)
except TypeError: # libxml2 doesn't parse text with null bytes
lxdoc = libxml2.readDoc(utf8body.replace("\x00", ""), response.url, \
'utf-8', xml_parser_options)
return lxdoc
class Libxml2Document(object_ref):
cache = weakref.WeakKeyDictionary()
__slots__ = ['xmlDoc', 'xpathContext', '__weakref__']
def __new__(cls, response, factory=xmlDoc_from_html):
cache = cls.cache.setdefault(response, {})
if factory not in cache:
obj = object_ref.__new__(cls)
obj.xmlDoc = factory(response)
obj.xpathContext = obj.xmlDoc.xpathNewContext()
cache[factory] = obj
return cache[factory]
def __del__(self):
# we must call both cleanup functions, so we try/except all exceptions
# to make sure one doesn't prevent the other from being called
# this call sometimes raises a "NoneType is not callable" TypeError
# so the try/except block silences them
try:
self.xmlDoc.freeDoc()
except:
pass
try:
self.xpathContext.xpathFreeContext()
except:
pass
def __str__(self):
return "<Libxml2Document %s>" % self.xmlDoc.name

View File

@ -1,117 +0,0 @@
"""
XPath selectors based on libxml2
"""
from scrapy import optional_features
if 'libxml2' in optional_features:
import libxml2
from scrapy.http import TextResponse
from scrapy.utils.python import unicode_to_str
from scrapy.utils.misc import extract_regex
from scrapy.utils.trackref import object_ref
from scrapy.utils.decorator import deprecated
from .libxml2document import Libxml2Document, xmlDoc_from_html, xmlDoc_from_xml
from .list import XPathSelectorList
__all__ = ['HtmlXPathSelector', 'XmlXPathSelector', 'XPathSelector', \
'XPathSelectorList']
class XPathSelector(object_ref):
__slots__ = ['doc', 'xmlNode', 'expr', '__weakref__']
def __init__(self, response=None, text=None, node=None, parent=None, expr=None):
if parent is not None:
self.doc = parent.doc
self.xmlNode = node
elif response:
self.doc = Libxml2Document(response, factory=self._get_libxml2_doc)
self.xmlNode = self.doc.xmlDoc
elif text:
response = TextResponse(url='about:blank', \
body=unicode_to_str(text, 'utf-8'), encoding='utf-8')
self.doc = Libxml2Document(response, factory=self._get_libxml2_doc)
self.xmlNode = self.doc.xmlDoc
self.expr = expr
def select(self, xpath):
if hasattr(self.xmlNode, 'xpathEval'):
self.doc.xpathContext.setContextNode(self.xmlNode)
xpath = unicode_to_str(xpath, 'utf-8')
try:
xpath_result = self.doc.xpathContext.xpathEval(xpath)
except libxml2.xpathError:
raise ValueError("Invalid XPath: %s" % xpath)
if hasattr(xpath_result, '__iter__'):
return XPathSelectorList([self.__class__(node=node, parent=self, \
expr=xpath) for node in xpath_result])
else:
return XPathSelectorList([self.__class__(node=xpath_result, \
parent=self, expr=xpath)])
else:
return XPathSelectorList([])
def re(self, regex):
return extract_regex(regex, self.extract())
def extract(self):
if isinstance(self.xmlNode, basestring):
text = unicode(self.xmlNode, 'utf-8', errors='ignore')
elif hasattr(self.xmlNode, 'serialize'):
if isinstance(self.xmlNode, libxml2.xmlDoc):
data = self.xmlNode.getRootElement().serialize('utf-8')
text = unicode(data, 'utf-8', errors='ignore') if data else u''
elif isinstance(self.xmlNode, libxml2.xmlAttr):
# serialization doesn't work sometimes for xmlAttr types
text = unicode(self.xmlNode.content, 'utf-8', errors='ignore')
else:
data = self.xmlNode.serialize('utf-8')
text = unicode(data, 'utf-8', errors='ignore') if data else u''
else:
try:
text = unicode(self.xmlNode, 'utf-8', errors='ignore')
except TypeError: # catched when self.xmlNode is a float - see tests
text = unicode(self.xmlNode)
return text
def extract_unquoted(self):
"""Get unescaped contents from the text node (no entities, no CDATA)"""
# TODO: this function should be deprecated. but what would be use instead?
if self.select('self::text()'):
return unicode(self.xmlNode.getContent(), 'utf-8', errors='ignore')
else:
return u''
def register_namespace(self, prefix, uri):
self.doc.xpathContext.xpathRegisterNs(prefix, uri)
def _get_libxml2_doc(self, response):
return xmlDoc_from_html(response)
def __nonzero__(self):
return bool(self.extract())
def __str__(self):
data = repr(self.extract()[:40])
return "<%s xpath=%r data=%s>" % (type(self).__name__, self.expr, data)
__repr__ = __str__
@deprecated(use_instead='XPathSelector.select')
def __call__(self, xpath):
return self.select(xpath)
@deprecated(use_instead='XPathSelector.select')
def x(self, xpath):
return self.select(xpath)
class XmlXPathSelector(XPathSelector):
__slots__ = ()
_get_libxml2_doc = staticmethod(xmlDoc_from_xml)
class HtmlXPathSelector(XPathSelector):
__slots__ = ()
_get_libxml2_doc = staticmethod(xmlDoc_from_html)

View File

@ -1,23 +0,0 @@
from scrapy.utils.python import flatten
from scrapy.utils.decorator import deprecated
class XPathSelectorList(list):
def __getslice__(self, i, j):
return self.__class__(list.__getslice__(self, i, j))
def select(self, xpath):
return self.__class__(flatten([x.select(xpath) for x in self]))
def re(self, regex):
return flatten([x.re(regex) for x in self])
def extract(self):
return [x.extract() for x in self]
def extract_unquoted(self):
return [x.extract_unquoted() for x in self]
@deprecated(use_instead='XPathSelectorList.select')
def x(self, xpath):
return self.select(xpath)

View File

@ -1,109 +1,47 @@
"""
XPath selectors based on lxml
"""
from lxml import etree
from scrapy.utils.misc import extract_regex
from scrapy.utils.trackref import object_ref
from scrapy.utils.python import unicode_to_str
from scrapy.utils.decorator import deprecated
from scrapy.http import TextResponse
from .lxmldocument import LxmlDocument
from .list import XPathSelectorList
from .unified import Selector, SelectorList
__all__ = ['HtmlXPathSelector', 'XmlXPathSelector', 'XPathSelector', \
__all__ = ['HtmlXPathSelector', 'XmlXPathSelector', 'XPathSelector',
'XPathSelectorList']
class XPathSelector(object_ref):
class XPathSelector(Selector):
__slots__ = ()
_default_type = 'html'
__slots__ = ['response', 'text', 'namespaces', '_expr', '_root', '__weakref__']
_parser = etree.HTMLParser
_tostring_method = 'html'
def __init__(self, *a, **kw):
import warnings
from scrapy.exceptions import ScrapyDeprecationWarning
warnings.warn('%s is deprecated, instanciate scrapy.selector.Selector '
'instead' % type(self).__name__,
category=ScrapyDeprecationWarning, stacklevel=1)
super(XPathSelector, self).__init__(*a, **kw)
def __init__(self, response=None, text=None, namespaces=None, _root=None, _expr=None):
if text is not None:
response = TextResponse(url='about:blank', \
body=unicode_to_str(text, 'utf-8'), encoding='utf-8')
if response is not None:
_root = LxmlDocument(response, self._parser)
self.namespaces = namespaces
self.response = response
self._root = _root
self._expr = _expr
def select(self, xpath):
try:
xpathev = self._root.xpath
except AttributeError:
return XPathSelectorList([])
try:
result = xpathev(xpath, namespaces=self.namespaces)
except etree.XPathError:
raise ValueError("Invalid XPath: %s" % xpath)
if type(result) is not list:
result = [result]
result = [self.__class__(_root=x, _expr=xpath, namespaces=self.namespaces)
for x in result]
return XPathSelectorList(result)
def re(self, regex):
return extract_regex(regex, self.extract())
def extract(self):
try:
return etree.tostring(self._root, method=self._tostring_method, \
encoding=unicode, with_tail=False)
except (AttributeError, TypeError):
if self._root is True:
return u'1'
elif self._root is False:
return u'0'
else:
return unicode(self._root)
def register_namespace(self, prefix, uri):
if self.namespaces is None:
self.namespaces = {}
self.namespaces[prefix] = uri
def remove_namespaces(self):
for el in self._root.iter('*'):
if el.tag.startswith('{'):
el.tag = el.tag.split('}', 1)[1]
# loop on element attributes also
for an in el.attrib.keys():
if an.startswith('{'):
el.attrib[an.split('}', 1)[1]] = el.attrib.pop(an)
def __nonzero__(self):
return bool(self.extract())
def __str__(self):
data = repr(self.extract()[:40])
return "<%s xpath=%r data=%s>" % (type(self).__name__, self._expr, data)
__repr__ = __str__
@deprecated(use_instead='XPathSelector.extract')
def extract_unquoted(self):
return self.extract()
def css(self, *a, **kw):
raise RuntimeError('.css() method not available for %s, '
'instanciate scrapy.selector.Selector '
'instead' % type(self).__name__)
class XmlXPathSelector(XPathSelector):
__slots__ = ()
_parser = etree.XMLParser
_tostring_method = 'xml'
_default_type = 'xml'
class HtmlXPathSelector(XPathSelector):
__slots__ = ()
_parser = etree.HTMLParser
_tostring_method = 'html'
_default_type = 'html'
class XPathSelectorList(SelectorList):
def __init__(self, *a, **kw):
import warnings
from scrapy.exceptions import ScrapyDeprecationWarning
warnings.warn('XPathSelectorList is deprecated, instanciate '
'scrapy.selector.SelectorList instead',
category=ScrapyDeprecationWarning, stacklevel=1)
super(XPathSelectorList, self).__init__(*a, **kw)

170
scrapy/selector/unified.py Normal file
View File

@ -0,0 +1,170 @@
"""
XPath selectors based on lxml
"""
from lxml import etree
from scrapy.utils.misc import extract_regex
from scrapy.utils.trackref import object_ref
from scrapy.utils.python import unicode_to_str, flatten
from scrapy.utils.decorator import deprecated
from scrapy.http import HtmlResponse, XmlResponse
from .lxmldocument import LxmlDocument
from .csstranslator import ScrapyHTMLTranslator, ScrapyGenericTranslator
__all__ = ['Selector', 'SelectorList']
_ctgroup = {
'html': {'_parser': etree.HTMLParser,
'_csstranslator': ScrapyHTMLTranslator(),
'_tostring_method': 'html'},
'xml': {'_parser': etree.XMLParser,
'_csstranslator': ScrapyGenericTranslator(),
'_tostring_method': 'xml'},
}
def _st(response, st):
if st is None:
return 'xml' if isinstance(response, XmlResponse) else 'html'
elif st in ('xml', 'html'):
return st
else:
raise ValueError('Invalid type: %s' % st)
def _response_from_text(text, st):
rt = XmlResponse if st == 'xml' else HtmlResponse
return rt(url='about:blank', encoding='utf-8',
body=unicode_to_str(text, 'utf-8'))
class Selector(object_ref):
__slots__ = ['response', 'text', 'namespaces', 'type', '_expr', '_root',
'__weakref__', '_parser', '_csstranslator', '_tostring_method']
_default_type = None
def __init__(self, response=None, text=None, type=None, namespaces=None,
_root=None, _expr=None):
self.type = st = _st(response, type or self._default_type)
self._parser = _ctgroup[st]['_parser']
self._csstranslator = _ctgroup[st]['_csstranslator']
self._tostring_method = _ctgroup[st]['_tostring_method']
if text is not None:
response = _response_from_text(text, st)
if response is not None:
_root = LxmlDocument(response, self._parser)
self.response = response
self.namespaces = namespaces
self._root = _root
self._expr = _expr
def xpath(self, query):
try:
xpathev = self._root.xpath
except AttributeError:
return SelectorList([])
try:
result = xpathev(query, namespaces=self.namespaces)
except etree.XPathError:
raise ValueError("Invalid XPath: %s" % query)
if type(result) is not list:
result = [result]
result = [self.__class__(_root=x, _expr=query,
namespaces=self.namespaces,
type=self.type)
for x in result]
return SelectorList(result)
def css(self, query):
return self.xpath(self._css2xpath(query))
def _css2xpath(self, query):
return self._csstranslator.css_to_xpath(query)
def re(self, regex):
return extract_regex(regex, self.extract())
def extract(self):
try:
return etree.tostring(self._root,
method=self._tostring_method,
encoding=unicode,
with_tail=False)
except (AttributeError, TypeError):
if self._root is True:
return u'1'
elif self._root is False:
return u'0'
else:
return unicode(self._root)
def register_namespace(self, prefix, uri):
if self.namespaces is None:
self.namespaces = {}
self.namespaces[prefix] = uri
def remove_namespaces(self):
for el in self._root.iter('*'):
if el.tag.startswith('{'):
el.tag = el.tag.split('}', 1)[1]
# loop on element attributes also
for an in el.attrib.keys():
if an.startswith('{'):
el.attrib[an.split('}', 1)[1]] = el.attrib.pop(an)
def __nonzero__(self):
return bool(self.extract())
def __str__(self):
data = repr(self.extract()[:40])
return "<%s xpath=%r data=%s>" % (type(self).__name__, self._expr, data)
__repr__ = __str__
# Deprecated api
@deprecated(use_instead='.xpath()')
def select(self, xpath):
return self.xpath(xpath)
@deprecated(use_instead='.extract()')
def extract_unquoted(self):
return self.extract()
class SelectorList(list):
def __getslice__(self, i, j):
return self.__class__(list.__getslice__(self, i, j))
def xpath(self, xpath):
return self.__class__(flatten([x.xpath(xpath) for x in self]))
def css(self, xpath):
return self.__class__(flatten([x.css(xpath) for x in self]))
def re(self, regex):
return flatten([x.re(regex) for x in self])
def extract(self):
return [x.extract() for x in self]
@deprecated(use_instead='.extract()')
def extract_unquoted(self):
return [x.extract_unquoted() for x in self]
@deprecated(use_instead='.xpath()')
def x(self, xpath):
return self.select(xpath)
@deprecated(use_instead='.xpath()')
def select(self, xpath):
return self.xpath(xpath)

View File

@ -11,20 +11,20 @@ from w3lib.url import any_to_uri
from scrapy.item import BaseItem
from scrapy.spider import BaseSpider
from scrapy.selector import XPathSelector, XmlXPathSelector, HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.utils.spider import create_spider_for_request
from scrapy.utils.misc import load_object
from scrapy.utils.response import open_in_browser
from scrapy.utils.console import start_python_console
from scrapy.settings import Settings
from scrapy.http import Request, Response, HtmlResponse, XmlResponse
from scrapy.http import Request, Response
from scrapy.exceptions import IgnoreRequest
class Shell(object):
relevant_classes = (BaseSpider, Request, Response, BaseItem,
XPathSelector, Settings)
Selector, Settings)
def __init__(self, crawler, update_vars=None, code=None):
self.crawler = crawler
@ -95,10 +95,7 @@ class Shell(object):
self.vars['spider'] = spider
self.vars['request'] = request
self.vars['response'] = response
self.vars['xxs'] = XmlXPathSelector(response) \
if isinstance(response, XmlResponse) else None
self.vars['hxs'] = HtmlXPathSelector(response) \
if isinstance(response, HtmlResponse) else None
self.vars['sel'] = Selector(response)
if self.inthread:
self.vars['fetch'] = self.fetch
self.vars['view'] = open_in_browser

View File

@ -31,7 +31,7 @@ class ShellTest(ProcessTest, SiteTest, unittest.TestCase):
@defer.inlineCallbacks
def test_response_selector_html(self):
xpath = 'hxs.select("//p[@class=\'one\']/text()").extract()[0]'
xpath = 'sel.xpath("//p[@class=\'one\']/text()").extract()[0]'
_, out, _ = yield self.execute([self.url('/html'), '-c', xpath])
self.assertEqual(out.strip(), 'Works')

View File

@ -4,7 +4,7 @@ from scrapy.contrib.loader import ItemLoader, XPathItemLoader
from scrapy.contrib.loader.processor import Join, Identity, TakeFirst, \
Compose, MapCompose
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
@ -379,7 +379,7 @@ class XPathItemLoaderTest(unittest.TestCase):
self.assertRaises(RuntimeError, XPathItemLoader)
def test_constructor_with_selector(self):
sel = HtmlXPathSelector(text=u"<html><body><div>marta</div></body></html>")
sel = Selector(text=u"<html><body><div>marta</div></body></html>")
l = TestXPathItemLoader(selector=sel)
self.assert_(l.selector is sel)
l.add_xpath('name', '//div/text()')

View File

@ -1,20 +0,0 @@
from twisted.trial import unittest
from scrapy.utils.test import libxml2debug
from scrapy import optional_features
class Libxml2Test(unittest.TestCase):
skip = 'libxml2' not in optional_features
@libxml2debug
def test_libxml2_bug_2_6_27(self):
# this test will fail in version 2.6.27 but passes on 2.6.29+
import libxml2
html = "<td>1<b>2</b>3</td>"
node = libxml2.htmlParseDoc(html, 'utf-8')
result = [str(r) for r in node.xpathEval('//text()')]
self.assertEquals(result, ['1', '2', '3'])
node.freeDoc()

View File

@ -1,87 +1,85 @@
"""
Selectors tests, common for all backends
"""
import re
import warnings
import weakref
from twisted.trial import unittest
from scrapy.exceptions import ScrapyDeprecationWarning
from scrapy.http import TextResponse, HtmlResponse, XmlResponse
from scrapy.selector import XmlXPathSelector, HtmlXPathSelector, \
XPathSelector
from scrapy.utils.test import libxml2debug
from scrapy.selector import Selector
from scrapy.selector.lxmlsel import XmlXPathSelector, HtmlXPathSelector, XPathSelector
class XPathSelectorTestCase(unittest.TestCase):
xs_cls = XPathSelector
hxs_cls = HtmlXPathSelector
xxs_cls = XmlXPathSelector
class SelectorTestCase(unittest.TestCase):
@libxml2debug
def test_selector_simple(self):
sscls = Selector
def test_simple_selection(self):
"""Simple selector tests"""
body = "<p><input name='a'value='1'/><input name='b'value='2'/></p>"
response = TextResponse(url="http://example.com", body=body)
xpath = self.hxs_cls(response)
sel = self.sscls(response)
xl = xpath.select('//input')
xl = sel.xpath('//input')
self.assertEqual(2, len(xl))
for x in xl:
assert isinstance(x, self.hxs_cls)
assert isinstance(x, self.sscls)
self.assertEqual(xpath.select('//input').extract(),
[x.extract() for x in xpath.select('//input')])
self.assertEqual(sel.xpath('//input').extract(),
[x.extract() for x in sel.xpath('//input')])
self.assertEqual([x.extract() for x in xpath.select("//input[@name='a']/@name")],
self.assertEqual([x.extract() for x in sel.xpath("//input[@name='a']/@name")],
[u'a'])
self.assertEqual([x.extract() for x in xpath.select("number(concat(//input[@name='a']/@value, //input[@name='b']/@value))")],
self.assertEqual([x.extract() for x in sel.xpath("number(concat(//input[@name='a']/@value, //input[@name='b']/@value))")],
[u'12.0'])
self.assertEqual(xpath.select("concat('xpath', 'rules')").extract(),
self.assertEqual(sel.xpath("concat('xpath', 'rules')").extract(),
[u'xpathrules'])
self.assertEqual([x.extract() for x in xpath.select("concat(//input[@name='a']/@value, //input[@name='b']/@value)")],
self.assertEqual([x.extract() for x in sel.xpath("concat(//input[@name='a']/@value, //input[@name='b']/@value)")],
[u'12'])
@libxml2debug
def test_selector_unicode_query(self):
def test_select_unicode_query(self):
body = u"<p><input name='\xa9' value='1'/></p>"
response = TextResponse(url="http://example.com", body=body, encoding='utf8')
xpath = self.hxs_cls(response)
self.assertEqual(xpath.select(u'//input[@name="\xa9"]/@value').extract(), [u'1'])
sel = self.sscls(response)
self.assertEqual(sel.xpath(u'//input[@name="\xa9"]/@value').extract(), [u'1'])
@libxml2debug
def test_selector_same_type(self):
"""Test XPathSelector returning the same type in x() method"""
def test_list_elements_type(self):
"""Test Selector returning the same type in selection methods"""
text = '<p>test<p>'
assert isinstance(self.xxs_cls(text=text).select("//p")[0],
self.xxs_cls)
assert isinstance(self.hxs_cls(text=text).select("//p")[0],
self.hxs_cls)
assert isinstance(self.sscls(text=text).xpath("//p")[0], self.sscls)
assert isinstance(self.sscls(text=text).css("p")[0], self.sscls)
@libxml2debug
def test_selector_boolean_result(self):
def test_boolean_result(self):
body = "<p><input name='a'value='1'/><input name='b'value='2'/></p>"
response = TextResponse(url="http://example.com", body=body)
xs = self.hxs_cls(response)
self.assertEquals(xs.select("//input[@name='a']/@name='a'").extract(), [u'1'])
self.assertEquals(xs.select("//input[@name='a']/@name='n'").extract(), [u'0'])
@libxml2debug
def test_selector_xml_html(self):
"""Test that XML and HTML XPathSelector's behave differently"""
xs = self.sscls(response)
self.assertEquals(xs.xpath("//input[@name='a']/@name='a'").extract(), [u'1'])
self.assertEquals(xs.xpath("//input[@name='a']/@name='n'").extract(), [u'0'])
def test_differences_parsing_xml_vs_html(self):
"""Test that XML and HTML Selector's behave differently"""
# some text which is parsed differently by XML and HTML flavors
text = '<div><img src="a.jpg"><p>Hello</div>'
self.assertEqual(self.xxs_cls(text=text).select("//div").extract(),
[u'<div><img src="a.jpg"><p>Hello</p></img></div>'])
self.assertEqual(self.hxs_cls(text=text).select("//div").extract(),
hs = self.sscls(text=text, type='html')
self.assertEqual(hs.xpath("//div").extract(),
[u'<div><img src="a.jpg"><p>Hello</p></div>'])
@libxml2debug
def test_selector_nested(self):
xs = self.sscls(text=text, type='xml')
self.assertEqual(xs.xpath("//div").extract(),
[u'<div><img src="a.jpg"><p>Hello</p></img></div>'])
def test_flavor_detection(self):
text = '<div><img src="a.jpg"><p>Hello</div>'
sel = self.sscls(XmlResponse('http://example.com', body=text))
self.assertEqual(sel.type, 'xml')
self.assertEqual(sel.xpath("//div").extract(),
[u'<div><img src="a.jpg"><p>Hello</p></img></div>'])
sel = self.sscls(HtmlResponse('http://example.com', body=text))
self.assertEqual(sel.type, 'html')
self.assertEqual(sel.xpath("//div").extract(),
[u'<div><img src="a.jpg"><p>Hello</p></div>'])
def test_nested_selectors(self):
"""Nested selector tests"""
body = """<body>
<div class='one'>
@ -97,26 +95,30 @@ class XPathSelectorTestCase(unittest.TestCase):
</body>"""
response = HtmlResponse(url="http://example.com", body=body)
x = self.hxs_cls(response)
divtwo = x.select('//div[@class="two"]')
self.assertEqual(map(unicode.strip, divtwo.select("//li").extract()),
x = self.sscls(response)
divtwo = x.xpath('//div[@class="two"]')
self.assertEqual(divtwo.xpath("//li").extract(),
["<li>one</li>", "<li>two</li>", "<li>four</li>", "<li>five</li>", "<li>six</li>"])
self.assertEqual(map(unicode.strip, divtwo.select("./ul/li").extract()),
self.assertEqual(divtwo.xpath("./ul/li").extract(),
["<li>four</li>", "<li>five</li>", "<li>six</li>"])
self.assertEqual(map(unicode.strip, divtwo.select(".//li").extract()),
self.assertEqual(divtwo.xpath(".//li").extract(),
["<li>four</li>", "<li>five</li>", "<li>six</li>"])
self.assertEqual(divtwo.select("./li").extract(),
[])
self.assertEqual(divtwo.xpath("./li").extract(), [])
def test_mixed_nested_selectors(self):
body = '''<body>
<div id=1>not<span>me</span></div>
<div class="dos"><p>text</p><a href='#'>foo</a></div>
</body>'''
sel = self.sscls(text=body)
self.assertEqual(sel.xpath('//div[@id="1"]').css('span::text').extract(), [u'me'])
self.assertEqual(sel.css('#1').xpath('./span/text()').extract(), [u'me'])
@libxml2debug
def test_dont_strip(self):
hxs = self.hxs_cls(text='<div>fff: <a href="#">zzz</a></div>')
self.assertEqual(hxs.select("//text()").extract(),
[u'fff: ', u'zzz'])
sel = self.sscls(text='<div>fff: <a href="#">zzz</a></div>')
self.assertEqual(sel.xpath("//text()").extract(), [u'fff: ', u'zzz'])
@libxml2debug
def test_selector_namespaces_simple(self):
def test_namespaces_simple(self):
body = """
<test xmlns:somens="http://scrapy.org">
<somens:a id="foo">take this</a>
@ -125,14 +127,13 @@ class XPathSelectorTestCase(unittest.TestCase):
"""
response = XmlResponse(url="http://example.com", body=body)
x = self.xxs_cls(response)
x = self.sscls(response)
x.register_namespace("somens", "http://scrapy.org")
self.assertEqual(x.select("//somens:a/text()").extract(),
self.assertEqual(x.xpath("//somens:a/text()").extract(),
[u'take this'])
@libxml2debug
def test_selector_namespaces_multiple(self):
def test_namespaces_multiple(self):
body = """<?xml version="1.0" encoding="UTF-8"?>
<BrowseNode xmlns="http://webservices.amazon.com/AWSECommerceService/2005-10-05"
xmlns:b="http://somens.com"
@ -143,20 +144,18 @@ class XPathSelectorTestCase(unittest.TestCase):
</BrowseNode>
"""
response = XmlResponse(url="http://example.com", body=body)
x = self.xxs_cls(response)
x = self.sscls(response)
x.register_namespace("xmlns", "http://webservices.amazon.com/AWSECommerceService/2005-10-05")
x.register_namespace("p", "http://www.scrapy.org/product")
x.register_namespace("b", "http://somens.com")
self.assertEqual(len(x.select("//xmlns:TestTag")), 1)
self.assertEqual(x.select("//b:Operation/text()").extract()[0], 'hello')
self.assertEqual(x.select("//xmlns:TestTag/@b:att").extract()[0], 'value')
self.assertEqual(x.select("//p:SecondTestTag/xmlns:price/text()").extract()[0], '90')
self.assertEqual(x.select("//p:SecondTestTag").select("./xmlns:price/text()")[0].extract(), '90')
self.assertEqual(x.select("//p:SecondTestTag/xmlns:material/text()").extract()[0], 'iron')
self.assertEqual(len(x.xpath("//xmlns:TestTag")), 1)
self.assertEqual(x.xpath("//b:Operation/text()").extract()[0], 'hello')
self.assertEqual(x.xpath("//xmlns:TestTag/@b:att").extract()[0], 'value')
self.assertEqual(x.xpath("//p:SecondTestTag/xmlns:price/text()").extract()[0], '90')
self.assertEqual(x.xpath("//p:SecondTestTag").xpath("./xmlns:price/text()")[0].extract(), '90')
self.assertEqual(x.xpath("//p:SecondTestTag/xmlns:material/text()").extract()[0], 'iron')
@libxml2debug
def test_selector_re(self):
def test_re(self):
body = """<div>Name: Mary
<ul>
<li>Name: John</li>
@ -165,47 +164,35 @@ class XPathSelectorTestCase(unittest.TestCase):
<li>Age: 20</li>
</ul>
Age: 20
</div>
"""
</div>"""
response = HtmlResponse(url="http://example.com", body=body)
x = self.hxs_cls(response)
x = self.sscls(response)
name_re = re.compile("Name: (\w+)")
self.assertEqual(x.select("//ul/li").re(name_re),
self.assertEqual(x.xpath("//ul/li").re(name_re),
["John", "Paul"])
self.assertEqual(x.select("//ul/li").re("Age: (\d+)"),
self.assertEqual(x.xpath("//ul/li").re("Age: (\d+)"),
["10", "20"])
@libxml2debug
def test_selector_re_intl(self):
def test_re_intl(self):
body = """<div>Evento: cumplea\xc3\xb1os</div>"""
response = HtmlResponse(url="http://example.com", body=body, encoding='utf-8')
x = self.hxs_cls(response)
self.assertEqual(x.select("//div").re("Evento: (\w+)"), [u'cumplea\xf1os'])
x = self.sscls(response)
self.assertEqual(x.xpath("//div").re("Evento: (\w+)"), [u'cumplea\xf1os'])
@libxml2debug
def test_selector_over_text(self):
hxs = self.hxs_cls(text='<root>lala</root>')
self.assertEqual(hxs.extract(),
u'<html><body><root>lala</root></body></html>')
hs = self.sscls(text='<root>lala</root>')
self.assertEqual(hs.extract(), u'<html><body><root>lala</root></body></html>')
xs = self.sscls(text='<root>lala</root>', type='xml')
self.assertEqual(xs.extract(), u'<root>lala</root>')
self.assertEqual(xs.xpath('.').extract(), [u'<root>lala</root>'])
xxs = self.xxs_cls(text='<root>lala</root>')
self.assertEqual(xxs.extract(),
u'<root>lala</root>')
xxs = self.xxs_cls(text='<root>lala</root>')
self.assertEqual(xxs.select('.').extract(),
[u'<root>lala</root>'])
@libxml2debug
def test_selector_invalid_xpath(self):
def test_invalid_xpath(self):
response = XmlResponse(url="http://example.com", body="<html></html>")
x = self.hxs_cls(response)
x = self.sscls(response)
xpath = "//test[@foo='bar]"
try:
x.select(xpath)
x.xpath(xpath)
except ValueError, e:
assert xpath in str(e), "Exception message does not contain invalid xpath"
except Exception:
@ -213,7 +200,6 @@ class XPathSelectorTestCase(unittest.TestCase):
else:
raise AssertionError("A invalid XPath does not raise an exception")
@libxml2debug
def test_http_header_encoding_precedence(self):
# u'\xa3' = pound symbol in unicode
# u'\xc2\xa3' = pound symbol in utf-8
@ -229,71 +215,121 @@ class XPathSelectorTestCase(unittest.TestCase):
headers = {'Content-Type': ['text/html; charset=utf-8']}
response = HtmlResponse(url="http://example.com", headers=headers, body=html_utf8)
x = self.hxs_cls(response)
self.assertEquals(x.select("//span[@id='blank']/text()").extract(),
x = self.sscls(response)
self.assertEquals(x.xpath("//span[@id='blank']/text()").extract(),
[u'\xa3'])
@libxml2debug
def test_empty_bodies(self):
# shouldn't raise errors
r1 = TextResponse('http://www.example.com', body='')
self.hxs_cls(r1).select('//text()').extract()
self.xxs_cls(r1).select('//text()').extract()
self.sscls(r1).xpath('//text()').extract()
@libxml2debug
def test_null_bytes(self):
# shouldn't raise errors
r1 = TextResponse('http://www.example.com', \
body='<root>pre\x00post</root>', \
encoding='utf-8')
self.hxs_cls(r1).select('//text()').extract()
self.xxs_cls(r1).select('//text()').extract()
self.sscls(r1).xpath('//text()').extract()
@libxml2debug
def test_badly_encoded_body(self):
# \xe9 alone isn't valid utf8 sequence
r1 = TextResponse('http://www.example.com', \
body='<html><p>an Jos\xe9 de</p><html>', \
encoding='utf-8')
self.hxs_cls(r1).select('//text()').extract()
self.xxs_cls(r1).select('//text()').extract()
self.sscls(r1).xpath('//text()').extract()
@libxml2debug
def test_select_on_unevaluable_nodes(self):
r = self.hxs_cls(text=u'<span class="big">some text</span>')
r = self.sscls(text=u'<span class="big">some text</span>')
# Text node
x1 = r.select('//text()')
x1 = r.xpath('//text()')
self.assertEquals(x1.extract(), [u'some text'])
self.assertEquals(x1.select('.//b').extract(), [])
self.assertEquals(x1.xpath('.//b').extract(), [])
# Tag attribute
x1 = r.select('//span/@class')
x1 = r.xpath('//span/@class')
self.assertEquals(x1.extract(), [u'big'])
self.assertEquals(x1.select('.//text()').extract(), [])
self.assertEquals(x1.xpath('.//text()').extract(), [])
@libxml2debug
def test_select_on_text_nodes(self):
r = self.hxs_cls(text=u'<div><b>Options:</b>opt1</div><div><b>Other</b>opt2</div>')
x1 = r.select("//div/descendant::text()[preceding-sibling::b[contains(text(), 'Options')]]")
r = self.sscls(text=u'<div><b>Options:</b>opt1</div><div><b>Other</b>opt2</div>')
x1 = r.xpath("//div/descendant::text()[preceding-sibling::b[contains(text(), 'Options')]]")
self.assertEquals(x1.extract(), [u'opt1'])
x1 = r.select("//div/descendant::text()/preceding-sibling::b[contains(text(), 'Options')]")
x1 = r.xpath("//div/descendant::text()/preceding-sibling::b[contains(text(), 'Options')]")
self.assertEquals(x1.extract(), [u'<b>Options:</b>'])
@libxml2debug
def test_nested_select_on_text_nodes(self):
# FIXME: does not work with lxml backend [upstream]
r = self.hxs_cls(text=u'<div><b>Options:</b>opt1</div><div><b>Other</b>opt2</div>')
x1 = r.select("//div/descendant::text()")
x2 = x1.select("./preceding-sibling::b[contains(text(), 'Options')]")
r = self.sscls(text=u'<div><b>Options:</b>opt1</div><div><b>Other</b>opt2</div>')
x1 = r.xpath("//div/descendant::text()")
x2 = x1.xpath("./preceding-sibling::b[contains(text(), 'Options')]")
self.assertEquals(x2.extract(), [u'<b>Options:</b>'])
test_nested_select_on_text_nodes.skip = True
test_nested_select_on_text_nodes.skip = "Text nodes lost parent node reference in lxml"
@libxml2debug
def test_weakref_slots(self):
"""Check that classes are using slots and are weak-referenceable"""
for cls in [self.xs_cls, self.hxs_cls, self.xxs_cls]:
x = cls()
weakref.ref(x)
assert not hasattr(x, '__dict__'), "%s does not use __slots__" % \
x.__class__.__name__
x = self.sscls()
weakref.ref(x)
assert not hasattr(x, '__dict__'), "%s does not use __slots__" % \
x.__class__.__name__
def test_remove_namespaces(self):
xml = """<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-US" xmlns:media="http://search.yahoo.com/mrss/">
<link type="text/html">
<link type="application/atom+xml">
</feed>
"""
sel = self.sscls(XmlResponse("http://example.com/feed.atom", body=xml))
self.assertEqual(len(sel.xpath("//link")), 0)
sel.remove_namespaces()
self.assertEqual(len(sel.xpath("//link")), 2)
def test_remove_attributes_namespaces(self):
xml = """<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns:atom="http://www.w3.org/2005/Atom" xml:lang="en-US" xmlns:media="http://search.yahoo.com/mrss/">
<link atom:type="text/html">
<link atom:type="application/atom+xml">
</feed>
"""
sel = self.sscls(XmlResponse("http://example.com/feed.atom", body=xml))
self.assertEqual(len(sel.xpath("//link/@type")), 0)
sel.remove_namespaces()
self.assertEqual(len(sel.xpath("//link/@type")), 2)
class DeprecatedXpathSelectorTest(unittest.TestCase):
text = '<div><img src="a.jpg"><p>Hello</div>'
def test_warnings(self):
for cls in XPathSelector, HtmlXPathSelector, XPathSelector:
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter('always')
hs = cls(text=self.text)
assert len(w) == 1, w
assert issubclass(w[0].category, ScrapyDeprecationWarning)
assert 'deprecated' in str(w[-1].message)
hs.select("//div").extract()
assert issubclass(w[1].category, ScrapyDeprecationWarning)
assert 'deprecated' in str(w[-1].message)
def test_xpathselector(self):
with warnings.catch_warnings(record=True):
hs = XPathSelector(text=self.text)
self.assertEqual(hs.select("//div").extract(),
[u'<div><img src="a.jpg"><p>Hello</p></div>'])
self.assertRaises(RuntimeError, hs.css, 'div')
def test_htmlxpathselector(self):
with warnings.catch_warnings(record=True):
hs = HtmlXPathSelector(text=self.text)
self.assertEqual(hs.select("//div").extract(),
[u'<div><img src="a.jpg"><p>Hello</p></div>'])
self.assertRaises(RuntimeError, hs.css, 'div')
def test_xmlxpathselector(self):
with warnings.catch_warnings(record=True):
xs = XmlXPathSelector(text=self.text)
self.assertEqual(xs.select("//div").extract(),
[u'<div><img src="a.jpg"><p>Hello</p></img></div>'])
self.assertRaises(RuntimeError, xs.css, 'div')

View File

@ -0,0 +1,153 @@
"""
Selector tests for cssselect backend
"""
from twisted.trial import unittest
from scrapy.http import HtmlResponse
from scrapy.selector.csstranslator import ScrapyHTMLTranslator
from scrapy.selector import Selector
from cssselect.parser import SelectorSyntaxError
from cssselect.xpath import ExpressionError
HTMLBODY = '''
<html>
<body>
<div>
<a id="name-anchor" name="foo"></a>
<a id="tag-anchor" rel="tag" href="http://localhost/foo">link</a>
<a id="nofollow-anchor" rel="nofollow" href="https://example.org"> link</a>
<p id="paragraph">
lorem ipsum text
<b id="p-b">hi</b> <em id="p-em">there</em>
<b id="p-b2">guy</b>
<input type="checkbox" id="checkbox-unchecked" />
<input type="checkbox" id="checkbox-disabled" disabled="" />
<input type="text" id="text-checked" checked="checked" />
<input type="hidden" />
<input type="hidden" disabled="disabled" />
<input type="checkbox" id="checkbox-checked" checked="checked" />
<input type="checkbox" id="checkbox-disabled-checked"
disabled="disabled" checked="checked" />
<fieldset id="fieldset" disabled="disabled">
<input type="checkbox" id="checkbox-fieldset-disabled" />
<input type="hidden" />
</fieldset>
</p>
<map name="dummymap">
<area shape="circle" coords="200,250,25" href="foo.html" id="area-href" />
<area shape="default" id="area-nohref" />
</map>
</div>
<div class="cool-footer" id="foobar-div" foobar="ab bc cde">
<span id="foobar-span">foo ter</span>
</div>
</body></html>
'''
class TranslatorMixinTest(unittest.TestCase):
tr_cls = ScrapyHTMLTranslator
def setUp(self):
self.tr = self.tr_cls()
self.c2x = self.tr.css_to_xpath
def test_attr_function(self):
cases = [
('::attr(name)', u'descendant-or-self::*/@name'),
('a::attr(href)', u'descendant-or-self::a/@href'),
('a ::attr(img)', u'descendant-or-self::a/descendant-or-self::*/@img'),
('a > ::attr(class)', u'descendant-or-self::a/*/@class'),
]
for css, xpath in cases:
self.assertEqual(self.c2x(css), xpath, css)
def test_attr_function_exception(self):
cases = [
('::attr(12)', ExpressionError),
('::attr(34test)', ExpressionError),
('::attr(@href)', SelectorSyntaxError),
]
for css, exc in cases:
self.assertRaises(exc, self.c2x, css)
def test_text_pseudo_element(self):
cases = [
('::text', u'descendant-or-self::text()'),
('p::text', u'descendant-or-self::p/text()'),
('p ::text', u'descendant-or-self::p/descendant-or-self::text()'),
('#id::text', u"descendant-or-self::*[@id = 'id']/text()"),
('p#id::text', u"descendant-or-self::p[@id = 'id']/text()"),
('p#id ::text', u"descendant-or-self::p[@id = 'id']/descendant-or-self::text()"),
('p#id > ::text', u"descendant-or-self::p[@id = 'id']/*/text()"),
('p#id ~ ::text', u"descendant-or-self::p[@id = 'id']/following-sibling::*/text()"),
('a[href]::text', u'descendant-or-self::a[@href]/text()'),
('a[href] ::text', u'descendant-or-self::a[@href]/descendant-or-self::text()'),
('p::text, a::text', u"descendant-or-self::p/text() | descendant-or-self::a/text()"),
]
for css, xpath in cases:
self.assertEqual(self.c2x(css), xpath, css)
def test_pseudo_function_exception(self):
cases = [
('::attribute(12)', ExpressionError),
('::text()', ExpressionError),
('::attr(@href)', SelectorSyntaxError),
]
for css, exc in cases:
self.assertRaises(exc, self.c2x, css)
def test_unknown_pseudo_element(self):
cases = [
('::text-node', ExpressionError),
]
for css, exc in cases:
self.assertRaises(exc, self.c2x, css)
def test_unknown_pseudo_class(self):
cases = [
(':text', ExpressionError),
(':attribute(name)', ExpressionError),
]
for css, exc in cases:
self.assertRaises(exc, self.c2x, css)
class CSSSelectorTest(unittest.TestCase):
sscls = Selector
def setUp(self):
self.htmlresponse = HtmlResponse('http://example.com', body=HTMLBODY)
self.sel = self.sscls(self.htmlresponse)
def x(self, *a, **kw):
return [v.strip() for v in self.sel.css(*a, **kw).extract() if v.strip()]
def test_selector_simple(self):
for x in self.sel.css('input'):
self.assertTrue(isinstance(x, self.sel.__class__), x)
self.assertEqual(self.sel.css('input').extract(),
[x.extract() for x in self.sel.css('input')])
def test_text_pseudo_element(self):
self.assertEqual(self.x('#p-b2'), [u'<b id="p-b2">guy</b>'])
self.assertEqual(self.x('#p-b2::text'), [u'guy'])
self.assertEqual(self.x('#p-b2 ::text'), [u'guy'])
self.assertEqual(self.x('#paragraph::text'), [u'lorem ipsum text'])
self.assertEqual(self.x('#paragraph ::text'), [u'lorem ipsum text', u'hi', u'there', u'guy'])
self.assertEqual(self.x('p::text'), [u'lorem ipsum text'])
self.assertEqual(self.x('p ::text'), [u'lorem ipsum text', u'hi', u'there', u'guy'])
def test_attribute_function(self):
self.assertEqual(self.x('#p-b2::attr(id)'), [u'p-b2'])
self.assertEqual(self.x('.cool-footer::attr(class)'), [u'cool-footer'])
self.assertEqual(self.x('.cool-footer ::attr(id)'), [u'foobar-div', u'foobar-span'])
self.assertEqual(self.x('map[name="dummymap"] ::attr(shape)'), [u'circle', u'default'])
def test_nested_selector(self):
self.assertEqual(self.sel.css('p').css('b::text').extract(),
[u'hi', u'guy'])
self.assertEqual(self.sel.css('div').css('area:last-child').extract(),
[u'<area shape="default" id="area-nohref">'])

View File

@ -1,98 +0,0 @@
"""
Selectors tests, specific for libxml2 backend
"""
from twisted.trial import unittest
from scrapy import optional_features
from scrapy.http import TextResponse, HtmlResponse, XmlResponse
from scrapy.selector.libxml2sel import XmlXPathSelector, HtmlXPathSelector, \
XPathSelector
from scrapy.selector.libxml2document import Libxml2Document
from scrapy.utils.test import libxml2debug
from scrapy.tests import test_selector
class Libxml2XPathSelectorTestCase(test_selector.XPathSelectorTestCase):
xs_cls = XPathSelector
hxs_cls = HtmlXPathSelector
xxs_cls = XmlXPathSelector
skip = 'libxml2' not in optional_features
@libxml2debug
def test_null_bytes(self):
hxs = HtmlXPathSelector(text='<root>la\x00la</root>')
self.assertEqual(hxs.extract(),
u'<html><body><root>lala</root></body></html>')
xxs = XmlXPathSelector(text='<root>la\x00la</root>')
self.assertEqual(xxs.extract(),
u'<root>lala</root>')
@libxml2debug
def test_unquote(self):
xmldoc = '\n'.join((
'<root>',
' lala',
' <node>',
' blabla&amp;more<!--comment-->a<b>test</b>oh',
' <![CDATA[lalalal&ppppp<b>PPPP</b>ppp&amp;la]]>',
' </node>',
' pff',
'</root>'))
xxs = XmlXPathSelector(text=xmldoc)
self.assertEqual(xxs.extract_unquoted(), u'')
self.assertEqual(xxs.select('/root').extract_unquoted(), [u''])
self.assertEqual(xxs.select('/root/text()').extract_unquoted(), [
u'\n lala\n ',
u'\n pff\n'])
self.assertEqual(xxs.select('//*').extract_unquoted(), [u'', u'', u''])
self.assertEqual(xxs.select('//text()').extract_unquoted(), [
u'\n lala\n ',
u'\n blabla&more',
u'a',
u'test',
u'oh\n ',
u'lalalal&ppppp<b>PPPP</b>ppp&amp;la',
u'\n ',
u'\n pff\n'])
class Libxml2DocumentTest(unittest.TestCase):
skip = 'libxml2' not in optional_features
@libxml2debug
def test_response_libxml2_caching(self):
r1 = HtmlResponse('http://www.example.com', body='<html><head></head><body></body></html>')
r2 = r1.copy()
doc1 = Libxml2Document(r1)
doc2 = Libxml2Document(r1)
doc3 = Libxml2Document(r2)
# make sure it's cached
assert doc1 is doc2
assert doc1.xmlDoc is doc2.xmlDoc
assert doc1 is not doc3
assert doc1.xmlDoc is not doc3.xmlDoc
# don't leave libxml2 documents in memory to avoid wrong libxml2 leaks reports
del doc1, doc2, doc3
@libxml2debug
def test_null_char(self):
# make sure bodies with null char ('\x00') don't raise a TypeError exception
self.body_content = 'test problematic \x00 body'
response = TextResponse('http://example.com/catalog/product/blabla-123',
headers={'Content-Type': 'text/plain; charset=utf-8'}, body=self.body_content)
Libxml2Document(response)
if __name__ == "__main__":
unittest.main()

View File

@ -1,64 +0,0 @@
"""
Selectors tests, specific for lxml backend
"""
import unittest
from scrapy.tests import test_selector
from scrapy.http import TextResponse, HtmlResponse, XmlResponse
from scrapy.selector.lxmldocument import LxmlDocument
from scrapy.selector.lxmlsel import XmlXPathSelector, HtmlXPathSelector, XPathSelector
class LxmlXPathSelectorTestCase(test_selector.XPathSelectorTestCase):
xs_cls = XPathSelector
hxs_cls = HtmlXPathSelector
xxs_cls = XmlXPathSelector
def test_remove_namespaces(self):
xml = """<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-US" xmlns:media="http://search.yahoo.com/mrss/">
<link type="text/html">
<link type="application/atom+xml">
</feed>
"""
xxs = XmlXPathSelector(XmlResponse("http://example.com/feed.atom", body=xml))
self.assertEqual(len(xxs.select("//link")), 0)
xxs.remove_namespaces()
self.assertEqual(len(xxs.select("//link")), 2)
def test_remove_attributes_namespaces(self):
xml = """<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns:atom="http://www.w3.org/2005/Atom" xml:lang="en-US" xmlns:media="http://search.yahoo.com/mrss/">
<link atom:type="text/html">
<link atom:type="application/atom+xml">
</feed>
"""
xxs = XmlXPathSelector(XmlResponse("http://example.com/feed.atom", body=xml))
self.assertEqual(len(xxs.select("//link/@type")), 0)
xxs.remove_namespaces()
self.assertEqual(len(xxs.select("//link/@type")), 2)
class Libxml2DocumentTest(unittest.TestCase):
def test_caching(self):
r1 = HtmlResponse('http://www.example.com', body='<html><head></head><body></body></html>')
r2 = r1.copy()
doc1 = LxmlDocument(r1)
doc2 = LxmlDocument(r1)
doc3 = LxmlDocument(r2)
# make sure it's cached
assert doc1 is doc2
assert doc1 is not doc3
# don't leave documents in memory to avoid wrong libxml2 leaks reports
del doc1, doc2, doc3
def test_null_char(self):
# make sure bodies with null char ('\x00') don't raise a TypeError exception
self.body_content = 'test problematic \x00 body'
response = TextResponse('http://example.com/catalog/product/blabla-123',
headers={'Content-Type': 'text/plain; charset=utf-8'}, body=self.body_content)
LxmlDocument(response)

View File

@ -0,0 +1,26 @@
import unittest
from scrapy.selector.lxmldocument import LxmlDocument
from scrapy.http import TextResponse, HtmlResponse
class LxmlDocumentTest(unittest.TestCase):
def test_caching(self):
r1 = HtmlResponse('http://www.example.com', body='<html><head></head><body></body></html>')
r2 = r1.copy()
doc1 = LxmlDocument(r1)
doc2 = LxmlDocument(r1)
doc3 = LxmlDocument(r2)
# make sure it's cached
assert doc1 is doc2
assert doc1 is not doc3
def test_null_char(self):
# make sure bodies with null char ('\x00') don't raise a TypeError exception
body = 'test problematic \x00 body'
response = TextResponse('http://example.com/catalog/product/blabla-123',
headers={'Content-Type': 'text/plain; charset=utf-8'},
body=body)
LxmlDocument(response)

View File

@ -70,10 +70,10 @@ class XMLFeedSpiderTest(BaseSpiderTest):
def parse_node(self, response, selector):
yield {
'loc': selector.select('a:loc/text()').extract(),
'updated': selector.select('b:updated/text()').extract(),
'other': selector.select('other/@value').extract(),
'custom': selector.select('other/@b:custom').extract(),
'loc': selector.xpath('a:loc/text()').extract(),
'updated': selector.xpath('b:updated/text()').extract(),
'other': selector.xpath('other/@value').extract(),
'custom': selector.xpath('other/@b:custom').extract(),
}
for iterator in ('iternodes', 'xml'):

View File

@ -28,7 +28,7 @@ class XmliterTestCase(unittest.TestCase):
response = XmlResponse(url="http://example.com", body=body)
attrs = []
for x in self.xmliter(response, 'product'):
attrs.append((x.select("@id").extract(), x.select("name/text()").extract(), x.select("./type/text()").extract()))
attrs.append((x.xpath("@id").extract(), x.xpath("name/text()").extract(), x.xpath("./type/text()").extract()))
self.assertEqual(attrs,
[(['001'], ['Name 1'], ['Type 1']), (['002'], ['Name 2'], ['Type 2'])])
@ -36,7 +36,7 @@ class XmliterTestCase(unittest.TestCase):
def test_xmliter_text(self):
body = u"""<?xml version="1.0" encoding="UTF-8"?><products><product>one</product><product>two</product></products>"""
self.assertEqual([x.select("text()").extract() for x in self.xmliter(body, 'product')],
self.assertEqual([x.xpath("text()").extract() for x in self.xmliter(body, 'product')],
[[u'one'], [u'two']])
def test_xmliter_namespaces(self):
@ -63,15 +63,15 @@ class XmliterTestCase(unittest.TestCase):
node = my_iter.next()
node.register_namespace('g', 'http://base.google.com/ns/1.0')
self.assertEqual(node.select('title/text()').extract(), ['Item 1'])
self.assertEqual(node.select('description/text()').extract(), ['This is item 1'])
self.assertEqual(node.select('link/text()').extract(), ['http://www.mydummycompany.com/items/1'])
self.assertEqual(node.select('g:image_link/text()').extract(), ['http://www.mydummycompany.com/images/item1.jpg'])
self.assertEqual(node.select('g:id/text()').extract(), ['ITEM_1'])
self.assertEqual(node.select('g:price/text()').extract(), ['400'])
self.assertEqual(node.select('image_link/text()').extract(), [])
self.assertEqual(node.select('id/text()').extract(), [])
self.assertEqual(node.select('price/text()').extract(), [])
self.assertEqual(node.xpath('title/text()').extract(), ['Item 1'])
self.assertEqual(node.xpath('description/text()').extract(), ['This is item 1'])
self.assertEqual(node.xpath('link/text()').extract(), ['http://www.mydummycompany.com/items/1'])
self.assertEqual(node.xpath('g:image_link/text()').extract(), ['http://www.mydummycompany.com/images/item1.jpg'])
self.assertEqual(node.xpath('g:id/text()').extract(), ['ITEM_1'])
self.assertEqual(node.xpath('g:price/text()').extract(), ['400'])
self.assertEqual(node.xpath('image_link/text()').extract(), [])
self.assertEqual(node.xpath('id/text()').extract(), [])
self.assertEqual(node.xpath('price/text()').extract(), [])
def test_xmliter_exception(self):
body = u"""<?xml version="1.0" encoding="UTF-8"?><products><product>one</product><product>two</product></products>"""
@ -123,9 +123,9 @@ class LxmlXmliterTestCase(XmliterTestCase):
namespace_iter = self.xmliter(response, 'image_link', 'http://base.google.com/ns/1.0')
node = namespace_iter.next()
self.assertEqual(node.select('text()').extract(), ['http://www.mydummycompany.com/images/item1.jpg'])
self.assertEqual(node.xpath('text()').extract(), ['http://www.mydummycompany.com/images/item1.jpg'])
node = namespace_iter.next()
self.assertEqual(node.select('text()').extract(), ['http://www.mydummycompany.com/images/item2.jpg'])
self.assertEqual(node.xpath('text()').extract(), ['http://www.mydummycompany.com/images/item2.jpg'])
class UtilsCsvTestCase(unittest.TestCase):

View File

@ -2,14 +2,14 @@ import re, csv
from cStringIO import StringIO
from scrapy.http import TextResponse
from scrapy.selector import XmlXPathSelector
from scrapy.selector import Selector
from scrapy import log
from scrapy.utils.python import re_rsearch, str_to_unicode
from scrapy.utils.response import body_or_str
def xmliter(obj, nodename):
"""Return a iterator of XPathSelector's over all nodes of a XML document,
"""Return a iterator of Selector's over all nodes of a XML document,
given tha name of the node to iterate. Useful for parsing XML feeds.
obj can be:
@ -29,7 +29,7 @@ def xmliter(obj, nodename):
r = re.compile(r"<%s[\s>].*?</%s>" % (nodename, nodename), re.DOTALL)
for match in r.finditer(text):
nodetext = header_start + match.group() + header_end
yield XmlXPathSelector(text=nodetext).select('//' + nodename)[0]
yield Selector(text=nodetext, type='xml').xpath('//' + nodename)[0]
def csviter(obj, delimiter=None, headers=None, encoding=None):

View File

@ -6,30 +6,6 @@ import os
from twisted.trial.unittest import SkipTest
def libxml2debug(testfunction):
"""Decorator for debugging libxml2 memory leaks inside a function.
We've found libxml2 memory leaks are something very weird, and can happen
sometimes depending on the order where tests are run. So this decorator
enables libxml2 memory leaks debugging only when the environment variable
LIBXML2_DEBUGLEAKS is set.
"""
try:
import libxml2
except ImportError:
return testfunction
def newfunc(*args, **kwargs):
libxml2.debugMemory(1)
testfunction(*args, **kwargs)
libxml2.cleanupParser()
leaked_bytes = libxml2.debugMemory(0)
assert leaked_bytes == 0, "libxml2 memory leak detected: %d bytes" % leaked_bytes
if 'LIBXML2_DEBUGLEAKS' in os.environ:
return newfunc
else:
return testfunction
def assert_aws_environ():
"""Asserts the current environment is suitable for running AWS testsi.

View File

@ -91,9 +91,9 @@ This is a working code sample that covers just the basics.
""" Pull the text label out of selected markup
:param entity: Found markup
:type entity: HtmlXPathSelector
:type entity: Selector
"""
label = ' '.join(entity.select('.//text()').extract())
label = ' '.join(entity.xpath('.//text()').extract())
label = label.encode('ascii', 'xmlcharrefreplace') if label else ''
label = label.strip('&#160;') if '&#160;' in label else label
label = label.strip(':') if ':' in label else label
@ -108,7 +108,7 @@ This is a working code sample that covers just the basics.
:return: The list of selectors
:rtype: list
"""
return self.selector.select(self.base_xpath + xpath)
return self.selector.xpath(self.base_xpath + xpath)
def parse_dl(self, xpath=u'//dl'):
""" Look for the specified definition list pattern and store all found
@ -120,7 +120,7 @@ This is a working code sample that covers just the basics.
for term in self._get_entities(xpath + '/dt'):
label = self._get_label(term)
if label and label not in self.ignore:
value = term.select('following-sibling::dd[1]//text()')
value = term.xpath('following-sibling::dd[1]//text()')
if value:
self.add_value(label, value.extract(),
MapCompose(lambda v: v.strip()))

View File

@ -122,6 +122,6 @@ try:
except ImportError:
from distutils.core import setup
else:
setup_args['install_requires'] = ['Twisted>=10.0.0', 'w3lib>=1.2', 'queuelib', 'lxml', 'pyOpenSSL']
setup_args['install_requires'] = ['Twisted>=10.0.0', 'w3lib>=1.2', 'queuelib', 'lxml', 'pyOpenSSL', 'cssselect>0.8']
setup(**setup_args)