mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-25 11:24:24 +00:00
Merge pull request #426 from scrapy/selectors-unified
[MRG] Selectors unified API
This commit is contained in:
commit
289688e39e
@ -143,13 +143,12 @@ Finally, here's the spider code::
|
||||
rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]
|
||||
|
||||
def parse_torrent(self, response):
|
||||
x = HtmlXPathSelector(response)
|
||||
|
||||
sel = Selector(response)
|
||||
torrent = TorrentItem()
|
||||
torrent['url'] = response.url
|
||||
torrent['name'] = x.select("//h1/text()").extract()
|
||||
torrent['description'] = x.select("//div[@id='description']").extract()
|
||||
torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
|
||||
torrent['name'] = sel.xpath("//h1/text()").extract()
|
||||
torrent['description'] = sel.xpath("//div[@id='description']").extract()
|
||||
torrent['size'] = sel.xpath("//div[@id='info-left']/p[2]/text()[2]").extract()
|
||||
return torrent
|
||||
|
||||
For brevity's sake, we intentionally left out the import statements. The
|
||||
|
@ -183,11 +183,12 @@ Introduction to Selectors
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
There are several ways to extract data from web pages. Scrapy uses a mechanism
|
||||
based on `XPath`_ expressions called :ref:`XPath selectors <topics-selectors>`.
|
||||
For more information about selectors and other extraction mechanisms see the
|
||||
:ref:`XPath selectors documentation <topics-selectors>`.
|
||||
based on `XPath`_ or `CSS`_ expressions called :ref:`Scrapy Selectors
|
||||
<topics-selectors>`. For more information about selectors and other extraction
|
||||
mechanisms see the :ref:`Selectors documentation <topics-selectors>`.
|
||||
|
||||
.. _XPath: http://www.w3.org/TR/xpath
|
||||
.. _CSS: http://www.w3.org/TR/selectors
|
||||
|
||||
Here are some examples of XPath expressions and their meanings:
|
||||
|
||||
@ -206,27 +207,28 @@ These are just a couple of simple examples of what you can do with XPath, but
|
||||
XPath expressions are indeed much more powerful. To learn more about XPath we
|
||||
recommend `this XPath tutorial <http://www.w3schools.com/XPath/default.asp>`_.
|
||||
|
||||
For working with XPaths, Scrapy provides a :class:`~scrapy.selector.XPathSelector`
|
||||
class, which comes in two flavours, :class:`~scrapy.selector.HtmlXPathSelector`
|
||||
(for HTML data) and :class:`~scrapy.selector.XmlXPathSelector` (for XML data). In
|
||||
order to use them you must instantiate the desired class with a
|
||||
:class:`~scrapy.http.Response` object.
|
||||
For working with XPaths, Scrapy provides a :class:`~scrapy.selector.Selector`
|
||||
class, it is instantiated with a :class:`~scrapy.http.HtmlResponse` or
|
||||
:class:`~scrapy.http.XmlResponse` object as first argument.
|
||||
|
||||
You can see selectors as objects that represent nodes in the document
|
||||
structure. So, the first instantiated selectors are associated to the root
|
||||
node, or the entire document.
|
||||
|
||||
Selectors have three methods (click on the method to see the complete API
|
||||
Selectors have four basic methods (click on the method to see the complete API
|
||||
documentation).
|
||||
|
||||
* :meth:`~scrapy.selector.XPathSelector.select`: returns a list of selectors, each of
|
||||
* :meth:`~scrapy.selector.Selector.xpath`: returns a list of selectors, each of
|
||||
them representing the nodes selected by the xpath expression given as
|
||||
argument.
|
||||
argument.
|
||||
|
||||
* :meth:`~scrapy.selector.XPathSelector.extract`: returns a unicode string with
|
||||
the data selected by the XPath selector.
|
||||
* :meth:`~scrapy.selector.Selector.xpath`: returns a list of selectors, each of
|
||||
them representing the nodes selected by the CSS expression given as argument.
|
||||
|
||||
* :meth:`~scrapy.selector.XPathSelector.re`: returns a list of unicode strings
|
||||
* :meth:`~scrapy.selector.Selector.extract`: returns a unicode string with the
|
||||
selected data.
|
||||
|
||||
* :meth:`~scrapy.selector.Selector.re`: returns a list of unicode strings
|
||||
extracted by applying the regular expression given as argument.
|
||||
|
||||
|
||||
@ -253,12 +255,11 @@ This is what the shell looks like::
|
||||
|
||||
[s] Available Scrapy objects:
|
||||
[s] 2010-08-19 21:45:59-0300 [default] INFO: Spider closed (finished)
|
||||
[s] hxs <HtmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>
|
||||
[s] sel <Selector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>
|
||||
[s] item Item()
|
||||
[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
|
||||
[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
|
||||
[s] spider <BaseSpider 'default' at 0x1b6c2d0>
|
||||
[s] xxs <XmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>
|
||||
[s] Useful shortcuts:
|
||||
[s] shelp() Print this help
|
||||
[s] fetch(req_or_url) Fetch a new request or URL and update shell objects
|
||||
@ -270,23 +271,25 @@ After the shell loads, you will have the response fetched in a local
|
||||
``response`` variable, so if you type ``response.body`` you will see the body
|
||||
of the response, or you can type ``response.headers`` to see its headers.
|
||||
|
||||
The shell also instantiates two selectors, one for HTML (in the ``hxs``
|
||||
variable) and one for XML (in the ``xxs`` variable) with this response. So let's
|
||||
try them::
|
||||
The shell also pre-instantiate a selector for this response in variable ``sel``,
|
||||
the selector automatically chooses the best parsing rules (XML vs HTML) based
|
||||
on response's type.
|
||||
|
||||
In [1]: hxs.select('//title')
|
||||
Out[1]: [<HtmlXPathSelector (title) xpath=//title>]
|
||||
So let's try it::
|
||||
|
||||
In [2]: hxs.select('//title').extract()
|
||||
In [1]: sel.xpath('//title')
|
||||
Out[1]: [<Selector (title) xpath=//title>]
|
||||
|
||||
In [2]: sel.xpath('//title').extract()
|
||||
Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']
|
||||
|
||||
In [3]: hxs.select('//title/text()')
|
||||
Out[3]: [<HtmlXPathSelector (text) xpath=//title/text()>]
|
||||
In [3]: sel.xpath('//title/text()')
|
||||
Out[3]: [<Selector (text) xpath=//title/text()>]
|
||||
|
||||
In [4]: hxs.select('//title/text()').extract()
|
||||
In [4]: sel.xpath('//title/text()').extract()
|
||||
Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']
|
||||
|
||||
In [5]: hxs.select('//title/text()').re('(\w+):')
|
||||
In [5]: sel.xpath('//title/text()').re('(\w+):')
|
||||
Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']
|
||||
|
||||
Extracting the data
|
||||
@ -306,29 +309,29 @@ is inside a ``<ul>`` element, in fact the *second* ``<ul>`` element.
|
||||
So we can select each ``<li>`` element belonging to the sites list with this
|
||||
code::
|
||||
|
||||
hxs.select('//ul/li')
|
||||
sel.xpath('//ul/li')
|
||||
|
||||
And from them, the sites descriptions::
|
||||
|
||||
hxs.select('//ul/li/text()').extract()
|
||||
sel.xpath('//ul/li/text()').extract()
|
||||
|
||||
The sites titles::
|
||||
|
||||
hxs.select('//ul/li/a/text()').extract()
|
||||
sel.xpath('//ul/li/a/text()').extract()
|
||||
|
||||
And the sites links::
|
||||
|
||||
hxs.select('//ul/li/a/@href').extract()
|
||||
sel.xpath('//ul/li/a/@href').extract()
|
||||
|
||||
As we said before, each ``select()`` call returns a list of selectors, so we can
|
||||
concatenate further ``select()`` calls to dig deeper into a node. We are going to use
|
||||
As we said before, each ``.xpath()`` call returns a list of selectors, so we can
|
||||
concatenate further ``.xpath()`` calls to dig deeper into a node. We are going to use
|
||||
that property here, so::
|
||||
|
||||
sites = hxs.select('//ul/li')
|
||||
sites = sel.xpath('//ul/li')
|
||||
for site in sites:
|
||||
title = site.select('a/text()').extract()
|
||||
link = site.select('a/@href').extract()
|
||||
desc = site.select('text()').extract()
|
||||
title = site.xpath('a/text()').extract()
|
||||
link = site.xpath('a/@href').extract()
|
||||
desc = site.xpath('text()').extract()
|
||||
print title, link, desc
|
||||
|
||||
.. note::
|
||||
@ -341,7 +344,7 @@ that property here, so::
|
||||
Let's add this code to our spider::
|
||||
|
||||
from scrapy.spider import BaseSpider
|
||||
from scrapy.selector import HtmlXPathSelector
|
||||
from scrapy.selector import Selector
|
||||
|
||||
class DmozSpider(BaseSpider):
|
||||
name = "dmoz"
|
||||
@ -352,12 +355,12 @@ Let's add this code to our spider::
|
||||
]
|
||||
|
||||
def parse(self, response):
|
||||
hxs = HtmlXPathSelector(response)
|
||||
sites = hxs.select('//ul/li')
|
||||
sel = Selector(response)
|
||||
sites = sel.xpath('//ul/li')
|
||||
for site in sites:
|
||||
title = site.select('a/text()').extract()
|
||||
link = site.select('a/@href').extract()
|
||||
desc = site.select('text()').extract()
|
||||
title = site.xpath('a/text()').extract()
|
||||
link = site.xpath('a/@href').extract()
|
||||
desc = site.xpath('text()').extract()
|
||||
print title, link, desc
|
||||
|
||||
Now try crawling the dmoz.org domain again and you'll see sites being printed
|
||||
@ -382,7 +385,7 @@ Spiders are expected to return their scraped data inside
|
||||
scraped so far, the final code for our Spider would be like this::
|
||||
|
||||
from scrapy.spider import BaseSpider
|
||||
from scrapy.selector import HtmlXPathSelector
|
||||
from scrapy.selector import Selector
|
||||
|
||||
from tutorial.items import DmozItem
|
||||
|
||||
@ -395,14 +398,14 @@ scraped so far, the final code for our Spider would be like this::
|
||||
]
|
||||
|
||||
def parse(self, response):
|
||||
hxs = HtmlXPathSelector(response)
|
||||
sites = hxs.select('//ul/li')
|
||||
sel = Selector(response)
|
||||
sites = sel.xpath('//ul/li')
|
||||
items = []
|
||||
for site in sites:
|
||||
item = DmozItem()
|
||||
item['title'] = site.select('a/text()').extract()
|
||||
item['link'] = site.select('a/@href').extract()
|
||||
item['desc'] = site.select('text()').extract()
|
||||
item['title'] = site.xpath('a/text()').extract()
|
||||
item['link'] = site.xpath('a/@href').extract()
|
||||
item['desc'] = site.xpath('text()').extract()
|
||||
items.append(item)
|
||||
return items
|
||||
|
||||
|
@ -9,6 +9,9 @@ Release notes
|
||||
- Request/Response url/body attributes are now immutable (modifying them had
|
||||
been deprecated for a long time)
|
||||
- :setting:`ITEM_PIPELINES` is now defined as a dict (instead of a list)
|
||||
- Dropped libxml2 selectors backend
|
||||
- Dropped support for multiple selectors backends, sticking to lxml only
|
||||
- Selector Unified API with support for CSS expressions (:issue:`395` and :issue:`426`)
|
||||
|
||||
0.18.4 (released 2013-10-10)
|
||||
----------------------------
|
||||
|
@ -248,7 +248,6 @@ Memory debugger extension
|
||||
An extension for debugging memory usage. It collects information about:
|
||||
|
||||
* objects uncollected by the Python garbage collector
|
||||
* libxml2 memory leaks
|
||||
* objects left alive that shouldn't. For more info, see :ref:`topics-leaks-trackrefs`
|
||||
|
||||
To enable this extension, turn on the :setting:`MEMDEBUG_ENABLED` setting. The
|
||||
|
@ -107,7 +107,7 @@ Now we're going to write the code to extract data from those pages.
|
||||
|
||||
With the help of Firebug, we'll take a look at some page containing links to
|
||||
websites (say http://directory.google.com/Top/Arts/Awards/) and find out how we can
|
||||
extract those links using :ref:`XPath selectors <topics-selectors>`. We'll also
|
||||
extract those links using :ref:`Selectors <topics-selectors>`. We'll also
|
||||
use the :ref:`Scrapy shell <topics-shell>` to test those XPath's and make sure
|
||||
they work as we expect.
|
||||
|
||||
@ -146,16 +146,16 @@ that have that grey colour of the links,
|
||||
Finally, we can write our ``parse_category()`` method::
|
||||
|
||||
def parse_category(self, response):
|
||||
hxs = HtmlXPathSelector(response)
|
||||
sel = Selector(response)
|
||||
|
||||
# The path to website links in directory page
|
||||
links = hxs.select('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td/font')
|
||||
links = sel.xpath('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td/font')
|
||||
|
||||
for link in links:
|
||||
item = DirectoryItem()
|
||||
item['name'] = link.select('a/text()').extract()
|
||||
item['url'] = link.select('a/@href').extract()
|
||||
item['description'] = link.select('font[2]/text()').extract()
|
||||
item['name'] = link.xpath('a/text()').extract()
|
||||
item['url'] = link.xpath('a/@href').extract()
|
||||
item['description'] = link.xpath('font[2]/text()').extract()
|
||||
yield item
|
||||
|
||||
|
||||
|
@ -67,7 +67,7 @@ alias to the :func:`~scrapy.utils.trackref.print_live_refs` function::
|
||||
|
||||
ExampleSpider 1 oldest: 15s ago
|
||||
HtmlResponse 10 oldest: 1s ago
|
||||
XPathSelector 2 oldest: 0s ago
|
||||
Selector 2 oldest: 0s ago
|
||||
FormRequest 878 oldest: 7s ago
|
||||
|
||||
As you can see, that report also shows the "age" of the oldest object in each
|
||||
@ -87,9 +87,8 @@ subclasses):
|
||||
* ``scrapy.http.Request``
|
||||
* ``scrapy.http.Response``
|
||||
* ``scrapy.item.Item``
|
||||
* ``scrapy.selector.XPathSelector``
|
||||
* ``scrapy.selector.Selector``
|
||||
* ``scrapy.spider.BaseSpider``
|
||||
* ``scrapy.selector.document.Libxml2Document``
|
||||
|
||||
A real example
|
||||
--------------
|
||||
@ -117,7 +116,7 @@ references::
|
||||
|
||||
SomenastySpider 1 oldest: 15s ago
|
||||
HtmlResponse 3890 oldest: 265s ago
|
||||
XPathSelector 2 oldest: 0s ago
|
||||
Selector 2 oldest: 0s ago
|
||||
Request 3878 oldest: 250s ago
|
||||
|
||||
The fact that there are so many live responses (and that they're so old) is
|
||||
|
@ -31,7 +31,7 @@ using the Item class specified in the :attr:`ItemLoader.default_item_class`
|
||||
attribute.
|
||||
|
||||
Then, you start collecting values into the Item Loader, typically using
|
||||
:ref:`XPath Selectors <topics-selectors>`. You can add more than one value to
|
||||
:ref:`Selectors <topics-selectors>`. You can add more than one value to
|
||||
the same item field; the Item Loader will know how to "join" those values later
|
||||
using a proper processing function.
|
||||
|
||||
@ -352,14 +352,14 @@ ItemLoader objects
|
||||
|
||||
The :class:`XPathItemLoader` class extends the :class:`ItemLoader` class
|
||||
providing more convenient mechanisms for extracting data from web pages
|
||||
using :ref:`XPath selectors <topics-selectors>`.
|
||||
using :ref:`selectors <topics-selectors>`.
|
||||
|
||||
:class:`XPathItemLoader` objects accept two more additional parameters in
|
||||
their constructors:
|
||||
|
||||
:param selector: The selector to extract data from, when using the
|
||||
:meth:`add_xpath` or :meth:`replace_xpath` method.
|
||||
:type selector: :class:`~scrapy.selector.XPathSelector` object
|
||||
:type selector: :class:`~scrapy.selector.Selector` object
|
||||
|
||||
:param response: The response used to construct the selector using the
|
||||
:attr:`default_selector_class`, unless the selector argument is given,
|
||||
@ -418,7 +418,7 @@ ItemLoader objects
|
||||
|
||||
.. attribute:: selector
|
||||
|
||||
The :class:`~scrapy.selector.XPathSelector` object to extract data from.
|
||||
The :class:`~scrapy.selector.Selector` object to extract data from.
|
||||
It's either the selector given in the constructor or one created from
|
||||
the response given in the constructor using the
|
||||
:attr:`default_selector_class`. This attribute is meant to be
|
||||
@ -592,7 +592,7 @@ Here is a list of all built-in processors:
|
||||
work with single values (instead of iterables). For this reason the
|
||||
:class:`MapCompose` processor is typically used as input processor, since
|
||||
data is often extracted using the
|
||||
:meth:`~scrapy.selector.XPathSelector.extract` method of :ref:`selectors
|
||||
:meth:`~scrapy.selector.Selector.extract` method of :ref:`selectors
|
||||
<topics-selectors>`, which returns a list of unicode strings.
|
||||
|
||||
The example below should clarify how it works::
|
||||
|
@ -6,39 +6,43 @@ Selectors
|
||||
|
||||
When you're scraping web pages, the most common task you need to perform is
|
||||
to extract data from the HTML source. There are several libraries available to
|
||||
achieve this:
|
||||
achieve this:
|
||||
|
||||
* `BeautifulSoup`_ is a very popular screen scraping library among Python
|
||||
programmers which constructs a Python object based on the
|
||||
structure of the HTML code and also deals with bad markup reasonably well,
|
||||
but it has one drawback: it's slow.
|
||||
programmers which constructs a Python object based on the structure of the
|
||||
HTML code and also deals with bad markup reasonably well, but it has one
|
||||
drawback: it's slow.
|
||||
|
||||
* `lxml`_ is a XML parsing library (which also parses HTML) with a pythonic
|
||||
API based on `ElementTree`_ (which is not part of the Python standard
|
||||
library).
|
||||
|
||||
Scrapy comes with its own mechanism for extracting data. They're called XPath
|
||||
selectors (or just "selectors", for short) because they "select" certain parts
|
||||
of the HTML document specified by `XPath`_ expressions.
|
||||
Scrapy comes with its own mechanism for extracting data. They're called
|
||||
selectors because they "select" certain parts of the HTML document specified
|
||||
either by `XPath`_ or `CSS`_ expressions.
|
||||
|
||||
`XPath`_ is a language for selecting nodes in XML documents, which can also be used with HTML.
|
||||
`XPath`_ is a language for selecting nodes in XML documents, which can also be
|
||||
used with HTML. `CSS`_ is a language for applying styles to HTML documents. It
|
||||
defines selectors to associate those styles with specific HTML elements.
|
||||
|
||||
Both `lxml`_ and Scrapy Selectors are built over the `libxml2`_ library, which
|
||||
means they're very similar in speed and parsing accuracy.
|
||||
Scrapy selectors are built over the `lxml`_ library, which means they're very
|
||||
similar in speed and parsing accuracy.
|
||||
|
||||
This page explains how selectors work and describes their API which is very
|
||||
small and simple, unlike the `lxml`_ API which is much bigger because the
|
||||
`lxml`_ library can be used for many other tasks, besides selecting markup
|
||||
documents.
|
||||
|
||||
For a complete reference of the selectors API see the :ref:`XPath selector
|
||||
reference <topics-selectors-ref>`.
|
||||
For a complete reference of the selectors API see
|
||||
:ref:`Selector reference <topics-selectors-ref>`
|
||||
|
||||
.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
|
||||
.. _lxml: http://codespeak.net/lxml/
|
||||
.. _ElementTree: http://docs.python.org/library/xml.etree.elementtree.html
|
||||
.. _libxml2: http://xmlsoft.org/
|
||||
.. _cssselect: https://pypi.python.org/pypi/cssselect/
|
||||
.. _XPath: http://www.w3.org/TR/xpath
|
||||
.. _CSS: http://www.w3.org/TR/selectors
|
||||
|
||||
|
||||
Using selectors
|
||||
===============
|
||||
@ -46,24 +50,29 @@ Using selectors
|
||||
Constructing selectors
|
||||
----------------------
|
||||
|
||||
There are two types of selectors bundled with Scrapy. Those are:
|
||||
|
||||
* :class:`~scrapy.selector.HtmlXPathSelector` - for working with HTML documents
|
||||
|
||||
* :class:`~scrapy.selector.XmlXPathSelector` - for working with XML documents
|
||||
|
||||
.. highlight:: python
|
||||
|
||||
Both share the same selector API, and are constructed with a Response object as
|
||||
their first parameter. This is the Response they're going to be "selecting".
|
||||
Scrapy selectors are instances of :class:`~scrapy.selector.Selector` class
|
||||
constructed by passing a `Response` object as first argument, the response's
|
||||
body is what they're going to be "selecting"::
|
||||
|
||||
Example::
|
||||
from scrapy.spider import BaseSpider
|
||||
from scrapy.selector import Selector
|
||||
|
||||
hxs = HtmlXPathSelector(response) # a HTML selector
|
||||
xxs = XmlXPathSelector(response) # a XML selector
|
||||
class MySpider(BaseSpider):
|
||||
# ...
|
||||
def parse(self, response):
|
||||
sel = Selector(response)
|
||||
# Using XPath query
|
||||
print sel.xpath('//p')
|
||||
# Using CSS query
|
||||
print sel.css('p')
|
||||
# Nesting queries
|
||||
print sel.xpath('//div[@foo="bar"]').css('span#bold')
|
||||
|
||||
Using selectors with XPaths
|
||||
---------------------------
|
||||
|
||||
Using selectors
|
||||
---------------
|
||||
|
||||
To explain how to use the selectors we'll use the `Scrapy shell` (which
|
||||
provides interactive testing) and an example page located in the Scrapy
|
||||
@ -84,78 +93,82 @@ First, let's open the shell::
|
||||
|
||||
scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
|
||||
|
||||
Then, after the shell loads, you'll have some selectors already instantiated and
|
||||
ready to use.
|
||||
Then, after the shell loads, you'll have a selector already instantiated and
|
||||
ready to use in ``sel`` shell variable.
|
||||
|
||||
Since we're dealing with HTML, we'll be using the
|
||||
:class:`~scrapy.selector.HtmlXPathSelector` object which is found, by default, in
|
||||
the ``hxs`` shell variable.
|
||||
Since we're dealing with HTML, the selector will automatically use an HTML parser.
|
||||
|
||||
.. highlight:: python
|
||||
|
||||
So, by looking at the :ref:`HTML code <topics-selectors-htmlcode>` of that page,
|
||||
let's construct an XPath (using an HTML selector) for selecting the text inside
|
||||
the title tag::
|
||||
So, by looking at the :ref:`HTML code <topics-selectors-htmlcode>` of that
|
||||
page, let's construct an XPath (using an HTML selector) for selecting the text
|
||||
inside the title tag::
|
||||
|
||||
>>> hxs.select('//title/text()')
|
||||
[<HtmlXPathSelector (text) xpath=//title/text()>]
|
||||
>>> sel.xpath('//title/text()')
|
||||
[<Selector (text) xpath=//title/text()>]
|
||||
|
||||
As you can see, the select() method returns an XPathSelectorList, which is a list of
|
||||
new selectors. This API can be used quickly for extracting nested data.
|
||||
As you can see, the ``.xpath()`` method returns an
|
||||
:class:`~scrapy.selector.SelectorList` instance, which is a list of new
|
||||
selectors. This API can be used quickly for extracting nested data.
|
||||
|
||||
To actually extract the textual data, you must call the selector ``extract()``
|
||||
To actually extract the textual data, you must call the selector ``.extract()``
|
||||
method, as follows::
|
||||
|
||||
>>> hxs.select('//title/text()').extract()
|
||||
>>> sel.xpath('//title/text()').extract()
|
||||
[u'Example website']
|
||||
|
||||
Notice that CSS selectors can select text or attribute nodes using CSS3
|
||||
pseudo-elements::
|
||||
|
||||
>>> sel.css('title::text').extract()
|
||||
[u'Example website']
|
||||
|
||||
Now we're going to get the base URL and some image links::
|
||||
|
||||
>>> hxs.select('//base/@href').extract()
|
||||
>>> sel.xpath('//base/@href').extract()
|
||||
[u'http://example.com/']
|
||||
|
||||
>>> hxs.select('//a[contains(@href, "image")]/@href').extract()
|
||||
>>> sel.css('base::attr(href)').extract()
|
||||
[u'http://example.com/']
|
||||
|
||||
>>> sel.xpath('//a[contains(@href, "image")]/@href').extract()
|
||||
[u'image1.html',
|
||||
u'image2.html',
|
||||
u'image3.html',
|
||||
u'image4.html',
|
||||
u'image5.html']
|
||||
|
||||
>>> hxs.select('//a[contains(@href, "image")]/img/@src').extract()
|
||||
>>> sel.css('a[href*=image]::attr(href)').extract()
|
||||
[u'image1.html',
|
||||
u'image2.html',
|
||||
u'image3.html',
|
||||
u'image4.html',
|
||||
u'image5.html']
|
||||
|
||||
>>> sel.xpath('//a[contains(@href, "image")]/img/@src').extract()
|
||||
[u'image1_thumb.jpg',
|
||||
u'image2_thumb.jpg',
|
||||
u'image3_thumb.jpg',
|
||||
u'image4_thumb.jpg',
|
||||
u'image5_thumb.jpg']
|
||||
|
||||
|
||||
Using selectors with regular expressions
|
||||
----------------------------------------
|
||||
|
||||
Selectors also have a ``re()`` method for extracting data using regular
|
||||
expressions. However, unlike using the ``select()`` method, the ``re()`` method
|
||||
does not return a list of :class:`~scrapy.selector.XPathSelector` objects, so you
|
||||
can't construct nested ``.re()`` calls.
|
||||
|
||||
Here's an example used to extract images names from the :ref:`HTML code
|
||||
<topics-selectors-htmlcode>` above::
|
||||
|
||||
>>> hxs.select('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
|
||||
[u'My image 1',
|
||||
u'My image 2',
|
||||
u'My image 3',
|
||||
u'My image 4',
|
||||
u'My image 5']
|
||||
>>> sel.css('a[href*=image] img::attr(src)').extract()
|
||||
[u'image1_thumb.jpg',
|
||||
u'image2_thumb.jpg',
|
||||
u'image3_thumb.jpg',
|
||||
u'image4_thumb.jpg',
|
||||
u'image5_thumb.jpg']
|
||||
|
||||
.. _topics-selectors-nesting-selectors:
|
||||
|
||||
Nesting selectors
|
||||
-----------------
|
||||
|
||||
The ``select()`` selector method returns a list of selectors, so you can call the
|
||||
``select()`` for those selectors too. Here's an example::
|
||||
The selection methods (``.xpath()`` or ``.css()``) returns a list of selectors
|
||||
of the same type, so you can call the selection methods for those selectors
|
||||
too. Here's an example::
|
||||
|
||||
>>> links = hxs.select('//a[contains(@href, "image")]')
|
||||
>>> links = sel.xpath('//a[contains(@href, "image")]')
|
||||
>>> links.extract()
|
||||
[u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
|
||||
u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
|
||||
@ -164,7 +177,7 @@ The ``select()`` selector method returns a list of selectors, so you can call th
|
||||
u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
|
||||
|
||||
>>> for index, link in enumerate(links):
|
||||
args = (index, link.select('@href').extract(), link.select('img/@src').extract())
|
||||
args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
|
||||
print 'Link number %d points to url %s and image %s' % args
|
||||
|
||||
Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
|
||||
@ -173,35 +186,53 @@ The ``select()`` selector method returns a list of selectors, so you can call th
|
||||
Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
|
||||
Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']
|
||||
|
||||
Using selectors with regular expressions
|
||||
----------------------------------------
|
||||
|
||||
:class:`~scrapy.selector.Selector` also have a ``.re()`` method for extracting
|
||||
data using regular expressions. However, unlike using ``.xpath()`` or
|
||||
``.css()`` methods, ``.re()`` method returns a list of unicode strings. So you
|
||||
can't construct nested ``.re()`` calls.
|
||||
|
||||
Here's an example used to extract images names from the :ref:`HTML code
|
||||
<topics-selectors-htmlcode>` above::
|
||||
|
||||
>>> sel.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
|
||||
[u'My image 1',
|
||||
u'My image 2',
|
||||
u'My image 3',
|
||||
u'My image 4',
|
||||
u'My image 5']
|
||||
|
||||
.. _topics-selectors-relative-xpaths:
|
||||
|
||||
Working with relative XPaths
|
||||
----------------------------
|
||||
|
||||
Keep in mind that if you are nesting XPathSelectors and use an XPath that
|
||||
starts with ``/``, that XPath will be absolute to the document and not relative
|
||||
to the ``XPathSelector`` you're calling it from.
|
||||
Keep in mind that if you are nesting selectors and use an XPath that starts
|
||||
with ``/``, that XPath will be absolute to the document and not relative to the
|
||||
``Selector`` you're calling it from.
|
||||
|
||||
For example, suppose you want to extract all ``<p>`` elements inside ``<div>``
|
||||
elements. First, you would get all ``<div>`` elements::
|
||||
|
||||
>>> divs = hxs.select('//div')
|
||||
>>> divs = sel.xpath('//div')
|
||||
|
||||
At first, you may be tempted to use the following approach, which is wrong, as
|
||||
it actually extracts all ``<p>`` elements from the document, not only those
|
||||
inside ``<div>`` elements::
|
||||
|
||||
>>> for p in divs.select('//p') # this is wrong - gets all <p> from the whole document
|
||||
>>> for p in divs.xpath('//p') # this is wrong - gets all <p> from the whole document
|
||||
>>> print p.extract()
|
||||
|
||||
This is the proper way to do it (note the dot prefixing the ``.//p`` XPath)::
|
||||
|
||||
>>> for p in divs.select('.//p') # extracts all <p> inside
|
||||
>>> for p in divs.xpath('.//p') # extracts all <p> inside
|
||||
>>> print p.extract()
|
||||
|
||||
Another common case would be to extract all direct ``<p>`` children::
|
||||
|
||||
>>> for p in divs.select('p')
|
||||
>>> for p in divs.xpath('p')
|
||||
>>> print p.extract()
|
||||
|
||||
For more details about relative XPaths see the `Location Paths`_ section in the
|
||||
@ -212,175 +243,170 @@ XPath specification.
|
||||
|
||||
.. _topics-selectors-ref:
|
||||
|
||||
Built-in XPath Selectors reference
|
||||
==================================
|
||||
Built-in Selectors reference
|
||||
============================
|
||||
|
||||
.. module:: scrapy.selector
|
||||
:synopsis: XPath selectors classes
|
||||
:synopsis: Selector class
|
||||
|
||||
There are two types of selectors bundled with Scrapy:
|
||||
:class:`HtmlXPathSelector` and :class:`XmlXPathSelector`. Both of them
|
||||
implement the same :class:`XPathSelector` interface. The only different is that
|
||||
one is used to process HTML data and the other XML data.
|
||||
.. class:: Selector(response=None, text=None, type=None)
|
||||
|
||||
XPathSelector objects
|
||||
---------------------
|
||||
An instance of :class:`Selector` is a wrapper over response to select
|
||||
certain parts of its content.
|
||||
|
||||
.. class:: XPathSelector(response)
|
||||
``response`` is a :class:`~scrapy.http.HtmlResponse` or
|
||||
:class:`~scrapy.http.XmlResponse` object that will be used for selecting and
|
||||
extracting data.
|
||||
|
||||
A :class:`XPathSelector` object is a wrapper over response to select
|
||||
certain parts of its content.
|
||||
``text`` is a unicode string or utf-8 encoded text for cases when a
|
||||
``response`` isn't available. Using ``text`` and ``response`` together is
|
||||
undefined behavior.
|
||||
|
||||
``response`` is a :class:`~scrapy.http.Response` object that will be used
|
||||
for selecting and extracting data
|
||||
``type`` defines the selector type, it can be ``"html"``, ``"xml"`` or ``None`` (default).
|
||||
|
||||
.. method:: select(xpath)
|
||||
If ``type`` is ``None``, the selector automatically chooses the best type
|
||||
based on ``response`` type (see below), or defaults to ``"html"`` in case it
|
||||
is used together with ``text``.
|
||||
|
||||
Apply the given XPath relative to this XPathSelector and return a list
|
||||
of :class:`XPathSelector` objects (ie. a :class:`XPathSelectorList`) with
|
||||
the result.
|
||||
If ``type`` is ``None`` and a ``response`` is passed, the selector type is
|
||||
inferred from the response type as follow:
|
||||
|
||||
``xpath`` is a string containing the XPath to apply
|
||||
* ``"html"`` for :class:`~scrapy.http.HtmlResponse` type
|
||||
* ``"xml"`` for :class:`~scrapy.http.XmlResponse` type
|
||||
* ``"html"`` for anything else
|
||||
|
||||
.. method:: re(regex)
|
||||
Otherwise, if ``type`` is set, the selector type will be forced and no
|
||||
detection will occur.
|
||||
|
||||
Apply the given regex and return a list of unicode strings with the
|
||||
matches.
|
||||
.. method:: xpath(query)
|
||||
|
||||
``regex`` can be either a compiled regular expression or a string which
|
||||
will be compiled to a regular expression using ``re.compile(regex)``
|
||||
Find nodes matching the xpath ``query`` and return the result as a
|
||||
:class:`SelectorList` instance with all elements flattened. List
|
||||
elements implement :class:`Selector` interface too.
|
||||
|
||||
``query`` is a string containing the XPATH query to apply.
|
||||
|
||||
.. method:: css(query)
|
||||
|
||||
Apply the given CSS selector and return a :class:`SelectorList` instance.
|
||||
|
||||
``query`` is a string containing the CSS selector to apply.
|
||||
|
||||
In the background, CSS queries are translated into XPath queries using
|
||||
`cssselect`_ library and run ``.xpath()`` method.
|
||||
|
||||
.. method:: extract()
|
||||
|
||||
Serialize and return the matched nodes as a list of unicode strings.
|
||||
Percent encoded content is unquoted.
|
||||
|
||||
.. method:: re(regex)
|
||||
|
||||
Apply the given regex and return a list of unicode strings with the
|
||||
matches.
|
||||
|
||||
``regex`` can be either a compiled regular expression or a string which
|
||||
will be compiled to a regular expression using ``re.compile(regex)``
|
||||
|
||||
.. method:: register_namespace(prefix, uri)
|
||||
|
||||
Register the given namespace to be used in this :class:`Selector`.
|
||||
Without registering namespaces you can't select or extract data from
|
||||
non-standard namespaces. See examples below.
|
||||
|
||||
.. method:: remove_namespaces()
|
||||
|
||||
Remove all namespaces, allowing to traverse the document using
|
||||
namespace-less xpaths. See example below.
|
||||
|
||||
.. method:: __nonzero__()
|
||||
|
||||
Returns ``True`` if there is any real content selected or ``False``
|
||||
otherwise. In other words, the boolean value of a :class:`Selector` is
|
||||
given by the contents it selects.
|
||||
|
||||
|
||||
SelectorList objects
|
||||
--------------------
|
||||
|
||||
.. class:: SelectorList
|
||||
|
||||
The :class:`SelectorList` class is subclass of the builtin ``list``
|
||||
class, which provides a few additional methods.
|
||||
|
||||
.. method:: xpath(query)
|
||||
|
||||
Call the ``.xpath()`` method for each element in this list and return
|
||||
their results flattened as another :class:`SelectorList`.
|
||||
|
||||
``query`` is the same argument as the one in :meth:`Selector.xpath`
|
||||
|
||||
.. method:: css(query)
|
||||
|
||||
Call the ``.css()`` method for each element in this list and return
|
||||
their results flattened as another :class:`SelectorList`.
|
||||
|
||||
``query`` is the same argument as the one in :meth:`Selector.css`
|
||||
|
||||
.. method:: extract()
|
||||
|
||||
Return a unicode string with the content of this :class:`XPathSelector`
|
||||
object.
|
||||
Call the ``.extract()`` method for each element is this list and return
|
||||
their results flattened, as a list of unicode strings.
|
||||
|
||||
.. method:: register_namespace(prefix, uri)
|
||||
.. method:: re()
|
||||
|
||||
Register the given namespace to be used in this :class:`XPathSelector`.
|
||||
Without registering namespaces you can't select or extract data from
|
||||
non-standard namespaces. See examples below.
|
||||
|
||||
.. method:: remove_namespaces()
|
||||
|
||||
Remove all namespaces, allowing to traverse the document using
|
||||
namespace-less xpaths. See example below.
|
||||
Call the ``.re()`` method for each element is this list and return
|
||||
their results flattened, as a list of unicode strings.
|
||||
|
||||
.. method:: __nonzero__()
|
||||
|
||||
Returns ``True`` if there is any real content selected by this
|
||||
:class:`XPathSelector` or ``False`` otherwise. In other words, the boolean
|
||||
value of an XPathSelector is given by the contents it selects.
|
||||
returns True if the list is not empty, False otherwise.
|
||||
|
||||
XPathSelectorList objects
|
||||
-------------------------
|
||||
|
||||
.. class:: XPathSelectorList
|
||||
Selector examples on HTML response
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The :class:`XPathSelectorList` class is subclass of the builtin ``list``
|
||||
class, which provides a few additional methods.
|
||||
Here's a couple of :class:`Selector` examples to illustrate several concepts.
|
||||
In all cases, we assume there is already an :class:`Selector` instantiated with
|
||||
a :class:`~scrapy.http.HtmlResponse` object like this::
|
||||
|
||||
.. method:: select(xpath)
|
||||
|
||||
Call the :meth:`XPathSelector.select` method for all :class:`XPathSelector`
|
||||
objects in this list and return their results flattened, as a new
|
||||
:class:`XPathSelectorList`.
|
||||
|
||||
``xpath`` is the same argument as the one in :meth:`XPathSelector.select`
|
||||
|
||||
.. method:: re(regex)
|
||||
|
||||
Call the :meth:`XPathSelector.re` method for all :class:`XPathSelector`
|
||||
objects in this list and return their results flattened, as a list of
|
||||
unicode strings.
|
||||
|
||||
``regex`` is the same argument as the one in :meth:`XPathSelector.re`
|
||||
|
||||
.. method:: extract()
|
||||
|
||||
Call the :meth:`XPathSelector.extract` method for all :class:`XPathSelector`
|
||||
objects in this list and return their results flattened, as a list of
|
||||
unicode strings.
|
||||
|
||||
.. method:: extract_unquoted()
|
||||
|
||||
Call the :meth:`XPathSelector.extract_unoquoted` method for all
|
||||
:class:`XPathSelector` objects in this list and return their results
|
||||
flattened, as a list of unicode strings. This method should not be applied
|
||||
to all kinds of XPathSelectors. For more info see
|
||||
:meth:`XPathSelector.extract_unoquoted`.
|
||||
|
||||
HtmlXPathSelector objects
|
||||
-------------------------
|
||||
|
||||
.. class:: HtmlXPathSelector(response)
|
||||
|
||||
A subclass of :class:`XPathSelector` for working with HTML content. It uses
|
||||
the `libxml2`_ HTML parser. See the :class:`XPathSelector` API for more info.
|
||||
|
||||
.. _libxml2: http://xmlsoft.org/
|
||||
|
||||
HtmlXPathSelector examples
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Here's a couple of :class:`HtmlXPathSelector` examples to illustrate several
|
||||
concepts. In all cases, we assume there is already an :class:`HtmlPathSelector`
|
||||
instantiated with a :class:`~scrapy.http.Response` object like this::
|
||||
|
||||
x = HtmlXPathSelector(html_response)
|
||||
x = Selector(html_response)
|
||||
|
||||
1. Select all ``<h1>`` elements from a HTML response body, returning a list of
|
||||
:class:`XPathSelector` objects (ie. a :class:`XPathSelectorList` object)::
|
||||
:class:`Selector` objects (ie. a :class:`SelectorList` object)::
|
||||
|
||||
x.select("//h1")
|
||||
x.xpath("//h1")
|
||||
|
||||
2. Extract the text of all ``<h1>`` elements from a HTML response body,
|
||||
returning a list of unicode strings::
|
||||
|
||||
x.select("//h1").extract() # this includes the h1 tag
|
||||
x.select("//h1/text()").extract() # this excludes the h1 tag
|
||||
x.xpath("//h1").extract() # this includes the h1 tag
|
||||
x.xpath("//h1/text()").extract() # this excludes the h1 tag
|
||||
|
||||
3. Iterate over all ``<p>`` tags and print their class attribute::
|
||||
|
||||
for node in x.select("//p"):
|
||||
... print node.select("@href")
|
||||
for node in x.xpath("//p"):
|
||||
... print node.xpath("@class").extract()
|
||||
|
||||
4. Extract textual data from all ``<p>`` tags without entities, as a list of
|
||||
unicode strings::
|
||||
Selector examples on XML response
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
x.select("//p/text()").extract_unquoted()
|
||||
Here's a couple of examples to illustrate several concepts. In both cases we
|
||||
assume there is already an :class:`Selector` instantiated with a
|
||||
:class:`~scrapy.http.XmlResponse` object like this::
|
||||
|
||||
# the following line is wrong. extract_unquoted() should only be used
|
||||
# with textual XPathSelectors
|
||||
x.select("//p").extract_unquoted() # it may work but output is unpredictable
|
||||
x = Selector(xml_response)
|
||||
|
||||
XmlXPathSelector objects
|
||||
------------------------
|
||||
1. Select all ``<product>`` elements from a XML response body, returning a list
|
||||
of :class:`Selector` objects (ie. a :class:`SelectorList` object)::
|
||||
|
||||
.. class:: XmlXPathSelector(response)
|
||||
|
||||
A subclass of :class:`XPathSelector` for working with XML content. It uses
|
||||
the `libxml2`_ XML parser. See the :class:`XPathSelector` API for more info.
|
||||
|
||||
XmlXPathSelector examples
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Here's a couple of :class:`XmlXPathSelector` examples to illustrate several
|
||||
concepts. In both cases we assume there is already an :class:`XmlXPathSelector`
|
||||
instantiated with a :class:`~scrapy.http.Response` object like this::
|
||||
|
||||
x = XmlXPathSelector(xml_response)
|
||||
|
||||
1. Select all ``<product>`` elements from a XML response body, returning a list of
|
||||
:class:`XPathSelector` objects (ie. a :class:`XPathSelectorList` object)::
|
||||
|
||||
x.select("//product")
|
||||
x.xpath("//product")
|
||||
|
||||
2. Extract all prices from a `Google Base XML feed`_ which requires registering
|
||||
a namespace::
|
||||
|
||||
x.register_namespace("g", "http://base.google.com/ns/1.0")
|
||||
x.select("//g:price").extract()
|
||||
x.xpath("//g:price").extract()
|
||||
|
||||
.. _removing-namespaces:
|
||||
|
||||
@ -390,7 +416,7 @@ Removing namespaces
|
||||
When dealing with scraping projects, it is often quite convenient to get rid of
|
||||
namespaces altogether and just work with element names, to write more
|
||||
simple/convenient XPaths. You can use the
|
||||
:meth:`XPathSelector.remove_namespaces` method for that.
|
||||
:meth:`Selector.remove_namespaces` method for that.
|
||||
|
||||
Let's show an example that illustrates this with Github blog atom feed.
|
||||
|
||||
@ -401,27 +427,27 @@ First, we open the shell with the url we want to scrape::
|
||||
Once in the shell we can try selecting all ``<link>`` objects and see that it
|
||||
doesn't work (because the Atom XML namespace is obfuscating those nodes)::
|
||||
|
||||
>>> xxs.select("//link")
|
||||
>>> xxs.xpath("//link")
|
||||
[]
|
||||
|
||||
But once we call the :meth:`XPathSelector.remove_namespaces` method, all
|
||||
But once we call the :meth:`Selector.remove_namespaces` method, all
|
||||
nodes can be accessed directly by their names::
|
||||
|
||||
>>> xxs.remove_namespaces()
|
||||
>>> xxs.select("//link")
|
||||
[<XmlXPathSelector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
|
||||
<XmlXPathSelector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
|
||||
>>> xxs.xpath("//link")
|
||||
[<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
|
||||
<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
|
||||
...
|
||||
|
||||
If you wonder why the namespace removal procedure is not always called, instead
|
||||
of having to call it manually. This is because of two reasons which, in order
|
||||
of relevance, are:
|
||||
|
||||
1. removing namespaces requires to iterate and modify all nodes in the
|
||||
1. Removing namespaces requires to iterate and modify all nodes in the
|
||||
document, which is a reasonably expensive operation to performs for all
|
||||
documents crawled by Scrapy
|
||||
|
||||
2. there could be some cases where using namespaces is actually required, in
|
||||
2. There could be some cases where using namespaces is actually required, in
|
||||
case some element names clash between namespaces. These cases are very rare
|
||||
though.
|
||||
|
||||
|
@ -9,10 +9,10 @@ scraping code very quickly, without having to run the spider. It's meant to be
|
||||
used for testing data extraction code, but you can actually use it for testing
|
||||
any kind of code as it is also a regular Python shell.
|
||||
|
||||
The shell is used for testing XPath expressions and see how they work and what
|
||||
data they extract from the web pages you're trying to scrape. It allows you to
|
||||
interactively test your XPaths while you're writing your spider, without having
|
||||
to run the spider to test every change.
|
||||
The shell is used for testing XPath or CSS expressions and see how they work
|
||||
and what data they extract from the web pages you're trying to scrape. It
|
||||
allows you to interactively test your expressions while you're writing your
|
||||
spider, without having to run the spider to test every change.
|
||||
|
||||
Once you get familiarized with the Scrapy shell, you'll see that it's an
|
||||
invaluable tool for developing and debugging your spiders.
|
||||
@ -66,7 +66,7 @@ Available Scrapy objects
|
||||
|
||||
The Scrapy shell automatically creates some convenient objects from the
|
||||
downloaded page, like the :class:`~scrapy.http.Response` object and the
|
||||
:class:`~scrapy.selector.XPathSelector` objects (for both HTML and XML
|
||||
:class:`~scrapy.selector.Selector` objects (for both HTML and XML
|
||||
content).
|
||||
|
||||
Those objects are:
|
||||
@ -83,10 +83,7 @@ Those objects are:
|
||||
* ``response`` - a :class:`~scrapy.http.Response` object containing the last
|
||||
fetched page
|
||||
|
||||
* ``hxs`` - a :class:`~scrapy.selector.HtmlXPathSelector` object constructed
|
||||
with the last response fetched
|
||||
|
||||
* ``xxs`` - a :class:`~scrapy.selector.XmlXPathSelector` object constructed
|
||||
* ``sel`` - a :class:`~scrapy.selector.Selector` object constructed
|
||||
with the last response fetched
|
||||
|
||||
* ``settings`` - the current :ref:`Scrapy settings <topics-settings>`
|
||||
@ -114,13 +111,12 @@ list of available objects and useful shortcuts (you'll notice that these lines
|
||||
all start with the ``[s]`` prefix)::
|
||||
|
||||
[s] Available objects
|
||||
[s] hxs <HtmlXPathSelector (http://scrapy.org) xpath=None>
|
||||
[s] sel <Selector (http://scrapy.org) xpath=None>
|
||||
[s] item Item()
|
||||
[s] request <http://scrapy.org>
|
||||
[s] response <http://scrapy.org>
|
||||
[s] settings <Settings 'mybot.settings'>
|
||||
[s] spider <scrapy.spider.models.BaseSpider object at 0x2bed9d0>
|
||||
[s] xxs <XmlXPathSelector (http://scrapy.org) xpath=None>
|
||||
[s] Useful shortcuts:
|
||||
[s] shelp() Prints this help.
|
||||
[s] fetch(req_or_url) Fetch a new request or URL and update objects
|
||||
@ -130,24 +126,23 @@ all start with the ``[s]`` prefix)::
|
||||
|
||||
After that, we can star playing with the objects::
|
||||
|
||||
>>> hxs.select("//h2/text()").extract()[0]
|
||||
>>> sel.xpath("//h2/text()").extract()[0]
|
||||
u'Welcome to Scrapy'
|
||||
|
||||
>>> fetch("http://slashdot.org")
|
||||
[s] Available Scrapy objects:
|
||||
[s] hxs <HtmlXPathSelector (http://slashdot.org) xpath=None>
|
||||
[s] sel <Selector (http://slashdot.org) xpath=None>
|
||||
[s] item JobItem()
|
||||
[s] request <GET http://slashdot.org>
|
||||
[s] response <200 http://slashdot.org>
|
||||
[s] settings <Settings 'jobsbot.settings'>
|
||||
[s] spider <BaseSpider 'default' at 0x3c44a10>
|
||||
[s] xxs <XmlXPathSelector (http://slashdot.org) xpath=None>
|
||||
[s] Useful shortcuts:
|
||||
[s] shelp() Shell help (print this help)
|
||||
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
|
||||
[s] view(response) View response in a browser
|
||||
|
||||
>>> hxs.select("//h2/text()").extract()
|
||||
>>> sel.xpath("//h2/text()").extract()
|
||||
[u'News for nerds, stuff that matters']
|
||||
|
||||
>>> request = request.replace(method="POST")
|
||||
@ -185,7 +180,7 @@ When you run the spider, you will get something similar to this::
|
||||
2009-08-27 19:15:25-0300 [example.com] DEBUG: Crawled <http://www.example.com/> (referer: <None>)
|
||||
2009-08-27 19:15:26-0300 [example.com] DEBUG: Crawled <http://www.example.com/products.php> (referer: <http://www.example.com/>)
|
||||
[s] Available objects
|
||||
[s] hxs <HtmlXPathSelector (http://www.example.com/products.php) xpath=None>
|
||||
[s] sel <Selector (http://www.example.com/products.php) xpath=None>
|
||||
...
|
||||
|
||||
>>> response.url
|
||||
@ -193,7 +188,7 @@ When you run the spider, you will get something similar to this::
|
||||
|
||||
Then, you can check if the extraction code is working::
|
||||
|
||||
>>> hxs.select('//h1')
|
||||
>>> sel.xpath('//h1')
|
||||
[]
|
||||
|
||||
Nope, it doesn't. So you can open the response in your web browser and see if
|
||||
|
@ -216,7 +216,7 @@ Let's see an example::
|
||||
|
||||
Another example returning multiples Requests and Items from a single callback::
|
||||
|
||||
from scrapy.selector import HtmlXPathSelector
|
||||
from scrapy.selector import Selector
|
||||
from scrapy.spider import BaseSpider
|
||||
from scrapy.http import Request
|
||||
from myproject.items import MyItem
|
||||
@ -231,11 +231,11 @@ Another example returning multiples Requests and Items from a single callback::
|
||||
]
|
||||
|
||||
def parse(self, response):
|
||||
hxs = HtmlXPathSelector(response)
|
||||
for h3 in hxs.select('//h3').extract():
|
||||
sel = Selector(response)
|
||||
for h3 in sel.xpath('//h3').extract():
|
||||
yield MyItem(title=h3)
|
||||
|
||||
for url in hxs.select('//a/@href').extract():
|
||||
for url in sel.xpath('//a/@href').extract():
|
||||
yield Request(url, callback=self.parse)
|
||||
|
||||
.. module:: scrapy.contrib.spiders
|
||||
@ -314,7 +314,7 @@ Let's now take a look at an example CrawlSpider with rules::
|
||||
|
||||
from scrapy.contrib.spiders import CrawlSpider, Rule
|
||||
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
|
||||
from scrapy.selector import HtmlXPathSelector
|
||||
from scrapy.selector import Selector
|
||||
from scrapy.item import Item
|
||||
|
||||
class MySpider(CrawlSpider):
|
||||
@ -334,11 +334,11 @@ Let's now take a look at an example CrawlSpider with rules::
|
||||
def parse_item(self, response):
|
||||
self.log('Hi, this is an item page! %s' % response.url)
|
||||
|
||||
hxs = HtmlXPathSelector(response)
|
||||
sel = Selector(response)
|
||||
item = Item()
|
||||
item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
|
||||
item['name'] = hxs.select('//td[@id="item_name"]/text()').extract()
|
||||
item['description'] = hxs.select('//td[@id="item_description"]/text()').extract()
|
||||
item['id'] = sel.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
|
||||
item['name'] = sel.xpath('//td[@id="item_name"]/text()').extract()
|
||||
item['description'] = sel.xpath('//td[@id="item_description"]/text()').extract()
|
||||
return item
|
||||
|
||||
|
||||
@ -366,15 +366,15 @@ XMLFeedSpider
|
||||
|
||||
A string which defines the iterator to use. It can be either:
|
||||
|
||||
- ``'iternodes'`` - a fast iterator based on regular expressions
|
||||
- ``'iternodes'`` - a fast iterator based on regular expressions
|
||||
|
||||
- ``'html'`` - an iterator which uses HtmlXPathSelector. Keep in mind
|
||||
this uses DOM parsing and must load all DOM in memory which could be a
|
||||
problem for big feeds
|
||||
- ``'html'`` - an iterator which uses :class:`~scrapy.selector.Selector`.
|
||||
Keep in mind this uses DOM parsing and must load all DOM in memory
|
||||
which could be a problem for big feeds
|
||||
|
||||
- ``'xml'`` - an iterator which uses XmlXPathSelector. Keep in mind
|
||||
this uses DOM parsing and must load all DOM in memory which could be a
|
||||
problem for big feeds
|
||||
- ``'xml'`` - an iterator which uses :class:`~scrapy.selector.Selector`.
|
||||
Keep in mind this uses DOM parsing and must load all DOM in memory
|
||||
which could be a problem for big feeds
|
||||
|
||||
It defaults to: ``'iternodes'``.
|
||||
|
||||
@ -390,7 +390,7 @@ XMLFeedSpider
|
||||
available in that document that will be processed with this spider. The
|
||||
``prefix`` and ``uri`` will be used to automatically register
|
||||
namespaces using the
|
||||
:meth:`~scrapy.selector.XPathSelector.register_namespace` method.
|
||||
:meth:`~scrapy.selector.Selector.register_namespace` method.
|
||||
|
||||
You can then specify nodes with namespaces in the :attr:`itertag`
|
||||
attribute.
|
||||
@ -416,9 +416,10 @@ XMLFeedSpider
|
||||
.. method:: parse_node(response, selector)
|
||||
|
||||
This method is called for the nodes matching the provided tag name
|
||||
(``itertag``). Receives the response and an XPathSelector for each node.
|
||||
Overriding this method is mandatory. Otherwise, you spider won't work.
|
||||
This method must return either a :class:`~scrapy.item.Item` object, a
|
||||
(``itertag``). Receives the response and an
|
||||
:class:`~scrapy.selector.Selector` for each node. Overriding this
|
||||
method is mandatory. Otherwise, you spider won't work. This method
|
||||
must return either a :class:`~scrapy.item.Item` object, a
|
||||
:class:`~scrapy.http.Request` object, or an iterable containing any of
|
||||
them.
|
||||
|
||||
@ -451,9 +452,9 @@ These spiders are pretty easy to use, let's have a look at one example::
|
||||
log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract())))
|
||||
|
||||
item = Item()
|
||||
item['id'] = node.select('@id').extract()
|
||||
item['name'] = node.select('name').extract()
|
||||
item['description'] = node.select('description').extract()
|
||||
item['id'] = node.xpath('@id').extract()
|
||||
item['name'] = node.xpath('name').extract()
|
||||
item['description'] = node.xpath('description').extract()
|
||||
return item
|
||||
|
||||
Basically what we did up there was to create a spider that downloads a feed from
|
||||
|
@ -30,13 +30,6 @@ except ImportError:
|
||||
else:
|
||||
optional_features.add('boto')
|
||||
|
||||
try:
|
||||
import libxml2
|
||||
except ImportError:
|
||||
pass
|
||||
else:
|
||||
optional_features.add('libxml2')
|
||||
|
||||
try:
|
||||
import django
|
||||
except ImportError:
|
||||
|
@ -6,6 +6,7 @@ import twisted
|
||||
import scrapy
|
||||
from scrapy.command import ScrapyCommand
|
||||
|
||||
|
||||
class Command(ScrapyCommand):
|
||||
|
||||
def syntax(self):
|
||||
@ -21,13 +22,9 @@ class Command(ScrapyCommand):
|
||||
|
||||
def run(self, args, opts):
|
||||
if opts.verbose:
|
||||
try:
|
||||
import lxml.etree
|
||||
except ImportError:
|
||||
lxml_version = libxml2_version = "(lxml not available)"
|
||||
else:
|
||||
lxml_version = ".".join(map(str, lxml.etree.LXML_VERSION))
|
||||
libxml2_version = ".".join(map(str, lxml.etree.LIBXML_VERSION))
|
||||
import lxml.etree
|
||||
lxml_version = ".".join(map(str, lxml.etree.LXML_VERSION))
|
||||
libxml2_version = ".".join(map(str, lxml.etree.LIBXML_VERSION))
|
||||
print "Scrapy : %s" % scrapy.__version__
|
||||
print "lxml : %s" % lxml_version
|
||||
print "libxml2 : %s" % libxml2_version
|
||||
|
@ -4,7 +4,7 @@ SGMLParser-based Link extractors
|
||||
import re
|
||||
from urlparse import urlparse, urljoin
|
||||
from w3lib.url import safe_url_string
|
||||
from scrapy.selector import HtmlXPathSelector
|
||||
from scrapy.selector import Selector
|
||||
from scrapy.link import Link
|
||||
from scrapy.linkextractor import IGNORED_EXTENSIONS
|
||||
from scrapy.utils.misc import arg_to_iter
|
||||
@ -116,11 +116,11 @@ class SgmlLinkExtractor(BaseSgmlLinkExtractor):
|
||||
def extract_links(self, response):
|
||||
base_url = None
|
||||
if self.restrict_xpaths:
|
||||
hxs = HtmlXPathSelector(response)
|
||||
sel = Selector(response)
|
||||
base_url = get_base_url(response)
|
||||
body = u''.join(f
|
||||
for x in self.restrict_xpaths
|
||||
for f in hxs.select(x).extract()
|
||||
for f in sel.xpath(x).extract()
|
||||
).encode(response.encoding)
|
||||
else:
|
||||
body = response.body
|
||||
|
@ -8,7 +8,7 @@ from collections import defaultdict
|
||||
import re
|
||||
|
||||
from scrapy.item import Item
|
||||
from scrapy.selector import HtmlXPathSelector
|
||||
from scrapy.selector import Selector
|
||||
from scrapy.utils.misc import arg_to_iter, extract_regex
|
||||
from scrapy.utils.python import flatten
|
||||
from .common import wrap_loader_context
|
||||
@ -116,7 +116,7 @@ class ItemLoader(object):
|
||||
|
||||
class XPathItemLoader(ItemLoader):
|
||||
|
||||
default_selector_class = HtmlXPathSelector
|
||||
default_selector_class = Selector
|
||||
|
||||
def __init__(self, item=None, selector=None, response=None, **context):
|
||||
if selector is None and response is None:
|
||||
@ -142,5 +142,4 @@ class XPathItemLoader(ItemLoader):
|
||||
|
||||
def _get_values(self, xpaths, **kw):
|
||||
xpaths = arg_to_iter(xpaths)
|
||||
return flatten([self.selector.select(xpath).extract() for xpath in xpaths])
|
||||
|
||||
return flatten([self.selector.xpath(xpath).extract() for xpath in xpaths])
|
||||
|
@ -10,14 +10,10 @@ from scrapy import signals
|
||||
from scrapy.exceptions import NotConfigured
|
||||
from scrapy.utils.trackref import live_refs
|
||||
|
||||
|
||||
class MemoryDebugger(object):
|
||||
|
||||
def __init__(self, stats):
|
||||
try:
|
||||
import libxml2
|
||||
self.libxml2 = libxml2
|
||||
except ImportError:
|
||||
self.libxml2 = None
|
||||
self.stats = stats
|
||||
|
||||
@classmethod
|
||||
@ -25,18 +21,10 @@ class MemoryDebugger(object):
|
||||
if not crawler.settings.getbool('MEMDEBUG_ENABLED'):
|
||||
raise NotConfigured
|
||||
o = cls(crawler.stats)
|
||||
crawler.signals.connect(o.engine_started, signals.engine_started)
|
||||
crawler.signals.connect(o.engine_stopped, signals.engine_stopped)
|
||||
return o
|
||||
|
||||
def engine_started(self):
|
||||
if self.libxml2:
|
||||
self.libxml2.debugMemory(1)
|
||||
|
||||
def engine_stopped(self):
|
||||
if self.libxml2:
|
||||
self.libxml2.cleanupParser()
|
||||
self.stats.set_value('memdebug/libxml2_leaked_bytes', self.libxml2.debugMemory(1))
|
||||
gc.collect()
|
||||
self.stats.set_value('memdebug/gc_garbage_count', len(gc.garbage))
|
||||
for cls, wdict in live_refs.iteritems():
|
||||
|
@ -9,7 +9,7 @@ from scrapy.item import BaseItem
|
||||
from scrapy.http import Request
|
||||
from scrapy.utils.iterators import xmliter, csviter
|
||||
from scrapy.utils.spider import iterate_spider_output
|
||||
from scrapy.selector import XmlXPathSelector, HtmlXPathSelector
|
||||
from scrapy.selector import Selector
|
||||
from scrapy.exceptions import NotConfigured, NotSupported
|
||||
|
||||
|
||||
@ -52,7 +52,7 @@ class XMLFeedSpider(BaseSpider):
|
||||
|
||||
def parse_nodes(self, response, nodes):
|
||||
"""This method is called for the nodes matching the provided tag name
|
||||
(itertag). Receives the response and an XPathSelector for each node.
|
||||
(itertag). Receives the response and an Selector for each node.
|
||||
Overriding this method is mandatory. Otherwise, you spider won't work.
|
||||
This method must return either a BaseItem, a Request, or a list
|
||||
containing any of them.
|
||||
@ -71,13 +71,13 @@ class XMLFeedSpider(BaseSpider):
|
||||
if self.iterator == 'iternodes':
|
||||
nodes = self._iternodes(response)
|
||||
elif self.iterator == 'xml':
|
||||
selector = XmlXPathSelector(response)
|
||||
selector = Selector(response, type='xml')
|
||||
self._register_namespaces(selector)
|
||||
nodes = selector.select('//%s' % self.itertag)
|
||||
nodes = selector.xpath('//%s' % self.itertag)
|
||||
elif self.iterator == 'html':
|
||||
selector = HtmlXPathSelector(response)
|
||||
selector = Selector(response, type='html')
|
||||
self._register_namespaces(selector)
|
||||
nodes = selector.select('//%s' % self.itertag)
|
||||
nodes = selector.xpath('//%s' % self.itertag)
|
||||
else:
|
||||
raise NotSupported('Unsupported node iterator')
|
||||
|
||||
|
@ -1,5 +1,5 @@
|
||||
from scrapy.http import Response
|
||||
from scrapy.selector import XmlXPathSelector
|
||||
from scrapy.selector import Selector
|
||||
|
||||
|
||||
def xmliter_lxml(obj, nodename, namespace=None):
|
||||
@ -11,10 +11,10 @@ def xmliter_lxml(obj, nodename, namespace=None):
|
||||
for _, node in iterable:
|
||||
nodetext = etree.tostring(node)
|
||||
node.clear()
|
||||
xs = XmlXPathSelector(text=nodetext)
|
||||
xs = Selector(text=nodetext, type='xml')
|
||||
if namespace:
|
||||
xs.register_namespace('x', namespace)
|
||||
yield xs.select(selxpath)[0]
|
||||
yield xs.xpath(selxpath)[0]
|
||||
|
||||
|
||||
class _StreamReader(object):
|
||||
|
@ -1,26 +1,5 @@
|
||||
"""
|
||||
XPath selectors
|
||||
|
||||
To select the backend explicitly use the SCRAPY_SELECTORS_BACKEND environment
|
||||
variable.
|
||||
|
||||
Two backends are currently available: lxml (default) and libxml2.
|
||||
|
||||
Selectors
|
||||
"""
|
||||
|
||||
import os
|
||||
|
||||
backend = os.environ.get('SCRAPY_SELECTORS_BACKEND')
|
||||
|
||||
if backend == 'libxml2':
|
||||
from scrapy.selector.libxml2sel import *
|
||||
elif backend == 'lxml':
|
||||
from scrapy.selector.lxmlsel import *
|
||||
else:
|
||||
try:
|
||||
import lxml
|
||||
except ImportError:
|
||||
import libxml2
|
||||
from scrapy.selector.libxml2sel import *
|
||||
else:
|
||||
from scrapy.selector.lxmlsel import *
|
||||
from scrapy.selector.unified import *
|
||||
from scrapy.selector.lxmlsel import *
|
||||
|
88
scrapy/selector/csstranslator.py
Normal file
88
scrapy/selector/csstranslator.py
Normal file
@ -0,0 +1,88 @@
|
||||
from cssselect import GenericTranslator, HTMLTranslator
|
||||
from cssselect.xpath import _unicode_safe_getattr, XPathExpr, ExpressionError
|
||||
from cssselect.parser import FunctionalPseudoElement
|
||||
|
||||
|
||||
class ScrapyXPathExpr(XPathExpr):
|
||||
|
||||
textnode = False
|
||||
attribute = None
|
||||
|
||||
@classmethod
|
||||
def from_xpath(cls, xpath, textnode=False, attribute=None):
|
||||
x = cls(path=xpath.path, element=xpath.element, condition=xpath.condition)
|
||||
x.textnode = textnode
|
||||
x.attribute = attribute
|
||||
return x
|
||||
|
||||
def __str__(self):
|
||||
path = super(ScrapyXPathExpr, self).__str__()
|
||||
if self.textnode:
|
||||
if path == '*':
|
||||
path = 'text()'
|
||||
elif path.endswith('::*/*'):
|
||||
path = path[:-3] + 'text()'
|
||||
else:
|
||||
path += '/text()'
|
||||
|
||||
if self.attribute is not None:
|
||||
if path.endswith('::*/*'):
|
||||
path = path[:-2]
|
||||
path += '/@%s' % self.attribute
|
||||
|
||||
return path
|
||||
|
||||
def join(self, combiner, other):
|
||||
super(ScrapyXPathExpr, self).join(combiner, other)
|
||||
self.textnode = other.textnode
|
||||
self.attribute = other.attribute
|
||||
return self
|
||||
|
||||
|
||||
class TranslatorMixin(object):
|
||||
|
||||
def xpath_element(self, selector):
|
||||
xpath = super(TranslatorMixin, self).xpath_element(selector)
|
||||
return ScrapyXPathExpr.from_xpath(xpath)
|
||||
|
||||
def xpath_pseudo_element(self, xpath, pseudo_element):
|
||||
if isinstance(pseudo_element, FunctionalPseudoElement):
|
||||
method = 'xpath_%s_functional_pseudo_element' % (
|
||||
pseudo_element.name.replace('-', '_'))
|
||||
method = _unicode_safe_getattr(self, method, None)
|
||||
if not method:
|
||||
raise ExpressionError(
|
||||
"The functional pseudo-element ::%s() is unknown"
|
||||
% pseudo_element.name)
|
||||
xpath = method(xpath, pseudo_element)
|
||||
else:
|
||||
method = 'xpath_%s_simple_pseudo_element' % (
|
||||
pseudo_element.replace('-', '_'))
|
||||
method = _unicode_safe_getattr(self, method, None)
|
||||
if not method:
|
||||
raise ExpressionError(
|
||||
"The pseudo-element ::%s is unknown"
|
||||
% pseudo_element)
|
||||
xpath = method(xpath)
|
||||
return xpath
|
||||
|
||||
def xpath_attr_functional_pseudo_element(self, xpath, function):
|
||||
if function.argument_types() not in (['STRING'], ['IDENT']):
|
||||
raise ExpressionError(
|
||||
"Expected a single string or ident for ::attr(), got %r"
|
||||
% function.arguments)
|
||||
return ScrapyXPathExpr.from_xpath(xpath,
|
||||
attribute=function.arguments[0].value)
|
||||
|
||||
def xpath_text_simple_pseudo_element(self, xpath):
|
||||
"""Support selecting text nodes using ::text pseudo-element"""
|
||||
return ScrapyXPathExpr.from_xpath(xpath, textnode=True)
|
||||
|
||||
|
||||
class ScrapyGenericTranslator(TranslatorMixin, GenericTranslator):
|
||||
pass
|
||||
|
||||
|
||||
class ScrapyHTMLTranslator(TranslatorMixin, HTMLTranslator):
|
||||
pass
|
||||
|
@ -1,82 +0,0 @@
|
||||
"""
|
||||
This module contains a simple class (Libxml2Document) which provides cache and
|
||||
garbage collection to libxml2 documents (xmlDoc).
|
||||
"""
|
||||
|
||||
import weakref
|
||||
from scrapy.utils.trackref import object_ref
|
||||
from scrapy import optional_features
|
||||
|
||||
if 'libxml2' in optional_features:
|
||||
import libxml2
|
||||
xml_parser_options = libxml2.XML_PARSE_RECOVER + \
|
||||
libxml2.XML_PARSE_NOERROR + \
|
||||
libxml2.XML_PARSE_NOWARNING
|
||||
|
||||
html_parser_options = libxml2.HTML_PARSE_RECOVER + \
|
||||
libxml2.HTML_PARSE_NOERROR + \
|
||||
libxml2.HTML_PARSE_NOWARNING
|
||||
|
||||
|
||||
_UTF8_ENCODINGS = set(('utf-8', 'UTF-8', 'utf8', 'UTF8'))
|
||||
def _body_as_utf8(response):
|
||||
if response.encoding in _UTF8_ENCODINGS:
|
||||
return response.body
|
||||
else:
|
||||
return response.body_as_unicode().encode('utf-8')
|
||||
|
||||
|
||||
def xmlDoc_from_html(response):
|
||||
"""Return libxml2 doc for HTMLs"""
|
||||
utf8body = _body_as_utf8(response) or ' '
|
||||
try:
|
||||
lxdoc = libxml2.htmlReadDoc(utf8body, response.url, 'utf-8', \
|
||||
html_parser_options)
|
||||
except TypeError: # libxml2 doesn't parse text with null bytes
|
||||
lxdoc = libxml2.htmlReadDoc(utf8body.replace("\x00", ""), response.url, \
|
||||
'utf-8', html_parser_options)
|
||||
return lxdoc
|
||||
|
||||
|
||||
def xmlDoc_from_xml(response):
|
||||
"""Return libxml2 doc for XMLs"""
|
||||
utf8body = _body_as_utf8(response) or ' '
|
||||
try:
|
||||
lxdoc = libxml2.readDoc(utf8body, response.url, 'utf-8', \
|
||||
xml_parser_options)
|
||||
except TypeError: # libxml2 doesn't parse text with null bytes
|
||||
lxdoc = libxml2.readDoc(utf8body.replace("\x00", ""), response.url, \
|
||||
'utf-8', xml_parser_options)
|
||||
return lxdoc
|
||||
|
||||
|
||||
class Libxml2Document(object_ref):
|
||||
|
||||
cache = weakref.WeakKeyDictionary()
|
||||
__slots__ = ['xmlDoc', 'xpathContext', '__weakref__']
|
||||
|
||||
def __new__(cls, response, factory=xmlDoc_from_html):
|
||||
cache = cls.cache.setdefault(response, {})
|
||||
if factory not in cache:
|
||||
obj = object_ref.__new__(cls)
|
||||
obj.xmlDoc = factory(response)
|
||||
obj.xpathContext = obj.xmlDoc.xpathNewContext()
|
||||
cache[factory] = obj
|
||||
return cache[factory]
|
||||
|
||||
def __del__(self):
|
||||
# we must call both cleanup functions, so we try/except all exceptions
|
||||
# to make sure one doesn't prevent the other from being called
|
||||
# this call sometimes raises a "NoneType is not callable" TypeError
|
||||
# so the try/except block silences them
|
||||
try:
|
||||
self.xmlDoc.freeDoc()
|
||||
except:
|
||||
pass
|
||||
try:
|
||||
self.xpathContext.xpathFreeContext()
|
||||
except:
|
||||
pass
|
||||
|
||||
def __str__(self):
|
||||
return "<Libxml2Document %s>" % self.xmlDoc.name
|
@ -1,117 +0,0 @@
|
||||
"""
|
||||
XPath selectors based on libxml2
|
||||
"""
|
||||
|
||||
from scrapy import optional_features
|
||||
if 'libxml2' in optional_features:
|
||||
import libxml2
|
||||
|
||||
from scrapy.http import TextResponse
|
||||
from scrapy.utils.python import unicode_to_str
|
||||
from scrapy.utils.misc import extract_regex
|
||||
from scrapy.utils.trackref import object_ref
|
||||
from scrapy.utils.decorator import deprecated
|
||||
from .libxml2document import Libxml2Document, xmlDoc_from_html, xmlDoc_from_xml
|
||||
from .list import XPathSelectorList
|
||||
|
||||
__all__ = ['HtmlXPathSelector', 'XmlXPathSelector', 'XPathSelector', \
|
||||
'XPathSelectorList']
|
||||
|
||||
class XPathSelector(object_ref):
|
||||
|
||||
__slots__ = ['doc', 'xmlNode', 'expr', '__weakref__']
|
||||
|
||||
def __init__(self, response=None, text=None, node=None, parent=None, expr=None):
|
||||
if parent is not None:
|
||||
self.doc = parent.doc
|
||||
self.xmlNode = node
|
||||
elif response:
|
||||
self.doc = Libxml2Document(response, factory=self._get_libxml2_doc)
|
||||
self.xmlNode = self.doc.xmlDoc
|
||||
elif text:
|
||||
response = TextResponse(url='about:blank', \
|
||||
body=unicode_to_str(text, 'utf-8'), encoding='utf-8')
|
||||
self.doc = Libxml2Document(response, factory=self._get_libxml2_doc)
|
||||
self.xmlNode = self.doc.xmlDoc
|
||||
self.expr = expr
|
||||
|
||||
def select(self, xpath):
|
||||
if hasattr(self.xmlNode, 'xpathEval'):
|
||||
self.doc.xpathContext.setContextNode(self.xmlNode)
|
||||
xpath = unicode_to_str(xpath, 'utf-8')
|
||||
try:
|
||||
xpath_result = self.doc.xpathContext.xpathEval(xpath)
|
||||
except libxml2.xpathError:
|
||||
raise ValueError("Invalid XPath: %s" % xpath)
|
||||
if hasattr(xpath_result, '__iter__'):
|
||||
return XPathSelectorList([self.__class__(node=node, parent=self, \
|
||||
expr=xpath) for node in xpath_result])
|
||||
else:
|
||||
return XPathSelectorList([self.__class__(node=xpath_result, \
|
||||
parent=self, expr=xpath)])
|
||||
else:
|
||||
return XPathSelectorList([])
|
||||
|
||||
def re(self, regex):
|
||||
return extract_regex(regex, self.extract())
|
||||
|
||||
def extract(self):
|
||||
if isinstance(self.xmlNode, basestring):
|
||||
text = unicode(self.xmlNode, 'utf-8', errors='ignore')
|
||||
elif hasattr(self.xmlNode, 'serialize'):
|
||||
if isinstance(self.xmlNode, libxml2.xmlDoc):
|
||||
data = self.xmlNode.getRootElement().serialize('utf-8')
|
||||
text = unicode(data, 'utf-8', errors='ignore') if data else u''
|
||||
elif isinstance(self.xmlNode, libxml2.xmlAttr):
|
||||
# serialization doesn't work sometimes for xmlAttr types
|
||||
text = unicode(self.xmlNode.content, 'utf-8', errors='ignore')
|
||||
else:
|
||||
data = self.xmlNode.serialize('utf-8')
|
||||
text = unicode(data, 'utf-8', errors='ignore') if data else u''
|
||||
else:
|
||||
try:
|
||||
text = unicode(self.xmlNode, 'utf-8', errors='ignore')
|
||||
except TypeError: # catched when self.xmlNode is a float - see tests
|
||||
text = unicode(self.xmlNode)
|
||||
return text
|
||||
|
||||
def extract_unquoted(self):
|
||||
"""Get unescaped contents from the text node (no entities, no CDATA)"""
|
||||
# TODO: this function should be deprecated. but what would be use instead?
|
||||
if self.select('self::text()'):
|
||||
return unicode(self.xmlNode.getContent(), 'utf-8', errors='ignore')
|
||||
else:
|
||||
return u''
|
||||
|
||||
def register_namespace(self, prefix, uri):
|
||||
self.doc.xpathContext.xpathRegisterNs(prefix, uri)
|
||||
|
||||
def _get_libxml2_doc(self, response):
|
||||
return xmlDoc_from_html(response)
|
||||
|
||||
def __nonzero__(self):
|
||||
return bool(self.extract())
|
||||
|
||||
def __str__(self):
|
||||
data = repr(self.extract()[:40])
|
||||
return "<%s xpath=%r data=%s>" % (type(self).__name__, self.expr, data)
|
||||
|
||||
__repr__ = __str__
|
||||
|
||||
@deprecated(use_instead='XPathSelector.select')
|
||||
def __call__(self, xpath):
|
||||
return self.select(xpath)
|
||||
|
||||
@deprecated(use_instead='XPathSelector.select')
|
||||
def x(self, xpath):
|
||||
return self.select(xpath)
|
||||
|
||||
|
||||
class XmlXPathSelector(XPathSelector):
|
||||
__slots__ = ()
|
||||
_get_libxml2_doc = staticmethod(xmlDoc_from_xml)
|
||||
|
||||
|
||||
class HtmlXPathSelector(XPathSelector):
|
||||
__slots__ = ()
|
||||
_get_libxml2_doc = staticmethod(xmlDoc_from_html)
|
@ -1,23 +0,0 @@
|
||||
from scrapy.utils.python import flatten
|
||||
from scrapy.utils.decorator import deprecated
|
||||
|
||||
class XPathSelectorList(list):
|
||||
|
||||
def __getslice__(self, i, j):
|
||||
return self.__class__(list.__getslice__(self, i, j))
|
||||
|
||||
def select(self, xpath):
|
||||
return self.__class__(flatten([x.select(xpath) for x in self]))
|
||||
|
||||
def re(self, regex):
|
||||
return flatten([x.re(regex) for x in self])
|
||||
|
||||
def extract(self):
|
||||
return [x.extract() for x in self]
|
||||
|
||||
def extract_unquoted(self):
|
||||
return [x.extract_unquoted() for x in self]
|
||||
|
||||
@deprecated(use_instead='XPathSelectorList.select')
|
||||
def x(self, xpath):
|
||||
return self.select(xpath)
|
@ -1,109 +1,47 @@
|
||||
"""
|
||||
XPath selectors based on lxml
|
||||
"""
|
||||
|
||||
from lxml import etree
|
||||
|
||||
from scrapy.utils.misc import extract_regex
|
||||
from scrapy.utils.trackref import object_ref
|
||||
from scrapy.utils.python import unicode_to_str
|
||||
from scrapy.utils.decorator import deprecated
|
||||
from scrapy.http import TextResponse
|
||||
from .lxmldocument import LxmlDocument
|
||||
from .list import XPathSelectorList
|
||||
from .unified import Selector, SelectorList
|
||||
|
||||
|
||||
__all__ = ['HtmlXPathSelector', 'XmlXPathSelector', 'XPathSelector', \
|
||||
__all__ = ['HtmlXPathSelector', 'XmlXPathSelector', 'XPathSelector',
|
||||
'XPathSelectorList']
|
||||
|
||||
|
||||
class XPathSelector(object_ref):
|
||||
class XPathSelector(Selector):
|
||||
__slots__ = ()
|
||||
_default_type = 'html'
|
||||
|
||||
__slots__ = ['response', 'text', 'namespaces', '_expr', '_root', '__weakref__']
|
||||
_parser = etree.HTMLParser
|
||||
_tostring_method = 'html'
|
||||
def __init__(self, *a, **kw):
|
||||
import warnings
|
||||
from scrapy.exceptions import ScrapyDeprecationWarning
|
||||
warnings.warn('%s is deprecated, instanciate scrapy.selector.Selector '
|
||||
'instead' % type(self).__name__,
|
||||
category=ScrapyDeprecationWarning, stacklevel=1)
|
||||
super(XPathSelector, self).__init__(*a, **kw)
|
||||
|
||||
def __init__(self, response=None, text=None, namespaces=None, _root=None, _expr=None):
|
||||
if text is not None:
|
||||
response = TextResponse(url='about:blank', \
|
||||
body=unicode_to_str(text, 'utf-8'), encoding='utf-8')
|
||||
if response is not None:
|
||||
_root = LxmlDocument(response, self._parser)
|
||||
|
||||
self.namespaces = namespaces
|
||||
self.response = response
|
||||
self._root = _root
|
||||
self._expr = _expr
|
||||
|
||||
def select(self, xpath):
|
||||
try:
|
||||
xpathev = self._root.xpath
|
||||
except AttributeError:
|
||||
return XPathSelectorList([])
|
||||
|
||||
try:
|
||||
result = xpathev(xpath, namespaces=self.namespaces)
|
||||
except etree.XPathError:
|
||||
raise ValueError("Invalid XPath: %s" % xpath)
|
||||
|
||||
if type(result) is not list:
|
||||
result = [result]
|
||||
|
||||
result = [self.__class__(_root=x, _expr=xpath, namespaces=self.namespaces)
|
||||
for x in result]
|
||||
return XPathSelectorList(result)
|
||||
|
||||
def re(self, regex):
|
||||
return extract_regex(regex, self.extract())
|
||||
|
||||
def extract(self):
|
||||
try:
|
||||
return etree.tostring(self._root, method=self._tostring_method, \
|
||||
encoding=unicode, with_tail=False)
|
||||
except (AttributeError, TypeError):
|
||||
if self._root is True:
|
||||
return u'1'
|
||||
elif self._root is False:
|
||||
return u'0'
|
||||
else:
|
||||
return unicode(self._root)
|
||||
|
||||
def register_namespace(self, prefix, uri):
|
||||
if self.namespaces is None:
|
||||
self.namespaces = {}
|
||||
self.namespaces[prefix] = uri
|
||||
|
||||
def remove_namespaces(self):
|
||||
for el in self._root.iter('*'):
|
||||
if el.tag.startswith('{'):
|
||||
el.tag = el.tag.split('}', 1)[1]
|
||||
# loop on element attributes also
|
||||
for an in el.attrib.keys():
|
||||
if an.startswith('{'):
|
||||
el.attrib[an.split('}', 1)[1]] = el.attrib.pop(an)
|
||||
|
||||
def __nonzero__(self):
|
||||
return bool(self.extract())
|
||||
|
||||
def __str__(self):
|
||||
data = repr(self.extract()[:40])
|
||||
return "<%s xpath=%r data=%s>" % (type(self).__name__, self._expr, data)
|
||||
|
||||
__repr__ = __str__
|
||||
|
||||
|
||||
@deprecated(use_instead='XPathSelector.extract')
|
||||
def extract_unquoted(self):
|
||||
return self.extract()
|
||||
def css(self, *a, **kw):
|
||||
raise RuntimeError('.css() method not available for %s, '
|
||||
'instanciate scrapy.selector.Selector '
|
||||
'instead' % type(self).__name__)
|
||||
|
||||
|
||||
class XmlXPathSelector(XPathSelector):
|
||||
__slots__ = ()
|
||||
_parser = etree.XMLParser
|
||||
_tostring_method = 'xml'
|
||||
_default_type = 'xml'
|
||||
|
||||
|
||||
class HtmlXPathSelector(XPathSelector):
|
||||
__slots__ = ()
|
||||
_parser = etree.HTMLParser
|
||||
_tostring_method = 'html'
|
||||
_default_type = 'html'
|
||||
|
||||
|
||||
class XPathSelectorList(SelectorList):
|
||||
|
||||
def __init__(self, *a, **kw):
|
||||
import warnings
|
||||
from scrapy.exceptions import ScrapyDeprecationWarning
|
||||
warnings.warn('XPathSelectorList is deprecated, instanciate '
|
||||
'scrapy.selector.SelectorList instead',
|
||||
category=ScrapyDeprecationWarning, stacklevel=1)
|
||||
super(XPathSelectorList, self).__init__(*a, **kw)
|
||||
|
170
scrapy/selector/unified.py
Normal file
170
scrapy/selector/unified.py
Normal file
@ -0,0 +1,170 @@
|
||||
"""
|
||||
XPath selectors based on lxml
|
||||
"""
|
||||
|
||||
from lxml import etree
|
||||
|
||||
from scrapy.utils.misc import extract_regex
|
||||
from scrapy.utils.trackref import object_ref
|
||||
from scrapy.utils.python import unicode_to_str, flatten
|
||||
from scrapy.utils.decorator import deprecated
|
||||
from scrapy.http import HtmlResponse, XmlResponse
|
||||
from .lxmldocument import LxmlDocument
|
||||
from .csstranslator import ScrapyHTMLTranslator, ScrapyGenericTranslator
|
||||
|
||||
|
||||
__all__ = ['Selector', 'SelectorList']
|
||||
|
||||
_ctgroup = {
|
||||
'html': {'_parser': etree.HTMLParser,
|
||||
'_csstranslator': ScrapyHTMLTranslator(),
|
||||
'_tostring_method': 'html'},
|
||||
'xml': {'_parser': etree.XMLParser,
|
||||
'_csstranslator': ScrapyGenericTranslator(),
|
||||
'_tostring_method': 'xml'},
|
||||
}
|
||||
|
||||
|
||||
def _st(response, st):
|
||||
if st is None:
|
||||
return 'xml' if isinstance(response, XmlResponse) else 'html'
|
||||
elif st in ('xml', 'html'):
|
||||
return st
|
||||
else:
|
||||
raise ValueError('Invalid type: %s' % st)
|
||||
|
||||
|
||||
def _response_from_text(text, st):
|
||||
rt = XmlResponse if st == 'xml' else HtmlResponse
|
||||
return rt(url='about:blank', encoding='utf-8',
|
||||
body=unicode_to_str(text, 'utf-8'))
|
||||
|
||||
|
||||
class Selector(object_ref):
|
||||
|
||||
__slots__ = ['response', 'text', 'namespaces', 'type', '_expr', '_root',
|
||||
'__weakref__', '_parser', '_csstranslator', '_tostring_method']
|
||||
|
||||
_default_type = None
|
||||
|
||||
def __init__(self, response=None, text=None, type=None, namespaces=None,
|
||||
_root=None, _expr=None):
|
||||
self.type = st = _st(response, type or self._default_type)
|
||||
self._parser = _ctgroup[st]['_parser']
|
||||
self._csstranslator = _ctgroup[st]['_csstranslator']
|
||||
self._tostring_method = _ctgroup[st]['_tostring_method']
|
||||
|
||||
if text is not None:
|
||||
response = _response_from_text(text, st)
|
||||
|
||||
if response is not None:
|
||||
_root = LxmlDocument(response, self._parser)
|
||||
|
||||
self.response = response
|
||||
self.namespaces = namespaces
|
||||
self._root = _root
|
||||
self._expr = _expr
|
||||
|
||||
def xpath(self, query):
|
||||
try:
|
||||
xpathev = self._root.xpath
|
||||
except AttributeError:
|
||||
return SelectorList([])
|
||||
|
||||
try:
|
||||
result = xpathev(query, namespaces=self.namespaces)
|
||||
except etree.XPathError:
|
||||
raise ValueError("Invalid XPath: %s" % query)
|
||||
|
||||
if type(result) is not list:
|
||||
result = [result]
|
||||
|
||||
result = [self.__class__(_root=x, _expr=query,
|
||||
namespaces=self.namespaces,
|
||||
type=self.type)
|
||||
for x in result]
|
||||
return SelectorList(result)
|
||||
|
||||
def css(self, query):
|
||||
return self.xpath(self._css2xpath(query))
|
||||
|
||||
def _css2xpath(self, query):
|
||||
return self._csstranslator.css_to_xpath(query)
|
||||
|
||||
def re(self, regex):
|
||||
return extract_regex(regex, self.extract())
|
||||
|
||||
def extract(self):
|
||||
try:
|
||||
return etree.tostring(self._root,
|
||||
method=self._tostring_method,
|
||||
encoding=unicode,
|
||||
with_tail=False)
|
||||
except (AttributeError, TypeError):
|
||||
if self._root is True:
|
||||
return u'1'
|
||||
elif self._root is False:
|
||||
return u'0'
|
||||
else:
|
||||
return unicode(self._root)
|
||||
|
||||
def register_namespace(self, prefix, uri):
|
||||
if self.namespaces is None:
|
||||
self.namespaces = {}
|
||||
self.namespaces[prefix] = uri
|
||||
|
||||
def remove_namespaces(self):
|
||||
for el in self._root.iter('*'):
|
||||
if el.tag.startswith('{'):
|
||||
el.tag = el.tag.split('}', 1)[1]
|
||||
# loop on element attributes also
|
||||
for an in el.attrib.keys():
|
||||
if an.startswith('{'):
|
||||
el.attrib[an.split('}', 1)[1]] = el.attrib.pop(an)
|
||||
|
||||
def __nonzero__(self):
|
||||
return bool(self.extract())
|
||||
|
||||
def __str__(self):
|
||||
data = repr(self.extract()[:40])
|
||||
return "<%s xpath=%r data=%s>" % (type(self).__name__, self._expr, data)
|
||||
__repr__ = __str__
|
||||
|
||||
# Deprecated api
|
||||
@deprecated(use_instead='.xpath()')
|
||||
def select(self, xpath):
|
||||
return self.xpath(xpath)
|
||||
|
||||
@deprecated(use_instead='.extract()')
|
||||
def extract_unquoted(self):
|
||||
return self.extract()
|
||||
|
||||
|
||||
class SelectorList(list):
|
||||
|
||||
def __getslice__(self, i, j):
|
||||
return self.__class__(list.__getslice__(self, i, j))
|
||||
|
||||
def xpath(self, xpath):
|
||||
return self.__class__(flatten([x.xpath(xpath) for x in self]))
|
||||
|
||||
def css(self, xpath):
|
||||
return self.__class__(flatten([x.css(xpath) for x in self]))
|
||||
|
||||
def re(self, regex):
|
||||
return flatten([x.re(regex) for x in self])
|
||||
|
||||
def extract(self):
|
||||
return [x.extract() for x in self]
|
||||
|
||||
@deprecated(use_instead='.extract()')
|
||||
def extract_unquoted(self):
|
||||
return [x.extract_unquoted() for x in self]
|
||||
|
||||
@deprecated(use_instead='.xpath()')
|
||||
def x(self, xpath):
|
||||
return self.select(xpath)
|
||||
|
||||
@deprecated(use_instead='.xpath()')
|
||||
def select(self, xpath):
|
||||
return self.xpath(xpath)
|
@ -11,20 +11,20 @@ from w3lib.url import any_to_uri
|
||||
|
||||
from scrapy.item import BaseItem
|
||||
from scrapy.spider import BaseSpider
|
||||
from scrapy.selector import XPathSelector, XmlXPathSelector, HtmlXPathSelector
|
||||
from scrapy.selector import Selector
|
||||
from scrapy.utils.spider import create_spider_for_request
|
||||
from scrapy.utils.misc import load_object
|
||||
from scrapy.utils.response import open_in_browser
|
||||
from scrapy.utils.console import start_python_console
|
||||
from scrapy.settings import Settings
|
||||
from scrapy.http import Request, Response, HtmlResponse, XmlResponse
|
||||
from scrapy.http import Request, Response
|
||||
from scrapy.exceptions import IgnoreRequest
|
||||
|
||||
|
||||
class Shell(object):
|
||||
|
||||
relevant_classes = (BaseSpider, Request, Response, BaseItem,
|
||||
XPathSelector, Settings)
|
||||
Selector, Settings)
|
||||
|
||||
def __init__(self, crawler, update_vars=None, code=None):
|
||||
self.crawler = crawler
|
||||
@ -95,10 +95,7 @@ class Shell(object):
|
||||
self.vars['spider'] = spider
|
||||
self.vars['request'] = request
|
||||
self.vars['response'] = response
|
||||
self.vars['xxs'] = XmlXPathSelector(response) \
|
||||
if isinstance(response, XmlResponse) else None
|
||||
self.vars['hxs'] = HtmlXPathSelector(response) \
|
||||
if isinstance(response, HtmlResponse) else None
|
||||
self.vars['sel'] = Selector(response)
|
||||
if self.inthread:
|
||||
self.vars['fetch'] = self.fetch
|
||||
self.vars['view'] = open_in_browser
|
||||
|
@ -31,7 +31,7 @@ class ShellTest(ProcessTest, SiteTest, unittest.TestCase):
|
||||
|
||||
@defer.inlineCallbacks
|
||||
def test_response_selector_html(self):
|
||||
xpath = 'hxs.select("//p[@class=\'one\']/text()").extract()[0]'
|
||||
xpath = 'sel.xpath("//p[@class=\'one\']/text()").extract()[0]'
|
||||
_, out, _ = yield self.execute([self.url('/html'), '-c', xpath])
|
||||
self.assertEqual(out.strip(), 'Works')
|
||||
|
||||
|
@ -4,7 +4,7 @@ from scrapy.contrib.loader import ItemLoader, XPathItemLoader
|
||||
from scrapy.contrib.loader.processor import Join, Identity, TakeFirst, \
|
||||
Compose, MapCompose
|
||||
from scrapy.item import Item, Field
|
||||
from scrapy.selector import HtmlXPathSelector
|
||||
from scrapy.selector import Selector
|
||||
from scrapy.http import HtmlResponse
|
||||
|
||||
|
||||
@ -379,7 +379,7 @@ class XPathItemLoaderTest(unittest.TestCase):
|
||||
self.assertRaises(RuntimeError, XPathItemLoader)
|
||||
|
||||
def test_constructor_with_selector(self):
|
||||
sel = HtmlXPathSelector(text=u"<html><body><div>marta</div></body></html>")
|
||||
sel = Selector(text=u"<html><body><div>marta</div></body></html>")
|
||||
l = TestXPathItemLoader(selector=sel)
|
||||
self.assert_(l.selector is sel)
|
||||
l.add_xpath('name', '//div/text()')
|
||||
|
@ -1,20 +0,0 @@
|
||||
from twisted.trial import unittest
|
||||
|
||||
from scrapy.utils.test import libxml2debug
|
||||
from scrapy import optional_features
|
||||
|
||||
|
||||
class Libxml2Test(unittest.TestCase):
|
||||
|
||||
skip = 'libxml2' not in optional_features
|
||||
|
||||
@libxml2debug
|
||||
def test_libxml2_bug_2_6_27(self):
|
||||
# this test will fail in version 2.6.27 but passes on 2.6.29+
|
||||
import libxml2
|
||||
html = "<td>1<b>2</b>3</td>"
|
||||
node = libxml2.htmlParseDoc(html, 'utf-8')
|
||||
result = [str(r) for r in node.xpathEval('//text()')]
|
||||
self.assertEquals(result, ['1', '2', '3'])
|
||||
node.freeDoc()
|
||||
|
@ -1,87 +1,85 @@
|
||||
"""
|
||||
Selectors tests, common for all backends
|
||||
"""
|
||||
|
||||
import re
|
||||
import warnings
|
||||
import weakref
|
||||
|
||||
from twisted.trial import unittest
|
||||
|
||||
from scrapy.exceptions import ScrapyDeprecationWarning
|
||||
from scrapy.http import TextResponse, HtmlResponse, XmlResponse
|
||||
from scrapy.selector import XmlXPathSelector, HtmlXPathSelector, \
|
||||
XPathSelector
|
||||
from scrapy.utils.test import libxml2debug
|
||||
from scrapy.selector import Selector
|
||||
from scrapy.selector.lxmlsel import XmlXPathSelector, HtmlXPathSelector, XPathSelector
|
||||
|
||||
class XPathSelectorTestCase(unittest.TestCase):
|
||||
|
||||
xs_cls = XPathSelector
|
||||
hxs_cls = HtmlXPathSelector
|
||||
xxs_cls = XmlXPathSelector
|
||||
class SelectorTestCase(unittest.TestCase):
|
||||
|
||||
@libxml2debug
|
||||
def test_selector_simple(self):
|
||||
sscls = Selector
|
||||
|
||||
def test_simple_selection(self):
|
||||
"""Simple selector tests"""
|
||||
body = "<p><input name='a'value='1'/><input name='b'value='2'/></p>"
|
||||
response = TextResponse(url="http://example.com", body=body)
|
||||
xpath = self.hxs_cls(response)
|
||||
sel = self.sscls(response)
|
||||
|
||||
xl = xpath.select('//input')
|
||||
xl = sel.xpath('//input')
|
||||
self.assertEqual(2, len(xl))
|
||||
for x in xl:
|
||||
assert isinstance(x, self.hxs_cls)
|
||||
assert isinstance(x, self.sscls)
|
||||
|
||||
self.assertEqual(xpath.select('//input').extract(),
|
||||
[x.extract() for x in xpath.select('//input')])
|
||||
self.assertEqual(sel.xpath('//input').extract(),
|
||||
[x.extract() for x in sel.xpath('//input')])
|
||||
|
||||
self.assertEqual([x.extract() for x in xpath.select("//input[@name='a']/@name")],
|
||||
self.assertEqual([x.extract() for x in sel.xpath("//input[@name='a']/@name")],
|
||||
[u'a'])
|
||||
self.assertEqual([x.extract() for x in xpath.select("number(concat(//input[@name='a']/@value, //input[@name='b']/@value))")],
|
||||
self.assertEqual([x.extract() for x in sel.xpath("number(concat(//input[@name='a']/@value, //input[@name='b']/@value))")],
|
||||
[u'12.0'])
|
||||
|
||||
self.assertEqual(xpath.select("concat('xpath', 'rules')").extract(),
|
||||
self.assertEqual(sel.xpath("concat('xpath', 'rules')").extract(),
|
||||
[u'xpathrules'])
|
||||
self.assertEqual([x.extract() for x in xpath.select("concat(//input[@name='a']/@value, //input[@name='b']/@value)")],
|
||||
self.assertEqual([x.extract() for x in sel.xpath("concat(//input[@name='a']/@value, //input[@name='b']/@value)")],
|
||||
[u'12'])
|
||||
|
||||
@libxml2debug
|
||||
def test_selector_unicode_query(self):
|
||||
def test_select_unicode_query(self):
|
||||
body = u"<p><input name='\xa9' value='1'/></p>"
|
||||
response = TextResponse(url="http://example.com", body=body, encoding='utf8')
|
||||
xpath = self.hxs_cls(response)
|
||||
self.assertEqual(xpath.select(u'//input[@name="\xa9"]/@value').extract(), [u'1'])
|
||||
sel = self.sscls(response)
|
||||
self.assertEqual(sel.xpath(u'//input[@name="\xa9"]/@value').extract(), [u'1'])
|
||||
|
||||
@libxml2debug
|
||||
def test_selector_same_type(self):
|
||||
"""Test XPathSelector returning the same type in x() method"""
|
||||
def test_list_elements_type(self):
|
||||
"""Test Selector returning the same type in selection methods"""
|
||||
text = '<p>test<p>'
|
||||
assert isinstance(self.xxs_cls(text=text).select("//p")[0],
|
||||
self.xxs_cls)
|
||||
assert isinstance(self.hxs_cls(text=text).select("//p")[0],
|
||||
self.hxs_cls)
|
||||
assert isinstance(self.sscls(text=text).xpath("//p")[0], self.sscls)
|
||||
assert isinstance(self.sscls(text=text).css("p")[0], self.sscls)
|
||||
|
||||
@libxml2debug
|
||||
def test_selector_boolean_result(self):
|
||||
def test_boolean_result(self):
|
||||
body = "<p><input name='a'value='1'/><input name='b'value='2'/></p>"
|
||||
response = TextResponse(url="http://example.com", body=body)
|
||||
xs = self.hxs_cls(response)
|
||||
self.assertEquals(xs.select("//input[@name='a']/@name='a'").extract(), [u'1'])
|
||||
self.assertEquals(xs.select("//input[@name='a']/@name='n'").extract(), [u'0'])
|
||||
|
||||
@libxml2debug
|
||||
def test_selector_xml_html(self):
|
||||
"""Test that XML and HTML XPathSelector's behave differently"""
|
||||
xs = self.sscls(response)
|
||||
self.assertEquals(xs.xpath("//input[@name='a']/@name='a'").extract(), [u'1'])
|
||||
self.assertEquals(xs.xpath("//input[@name='a']/@name='n'").extract(), [u'0'])
|
||||
|
||||
def test_differences_parsing_xml_vs_html(self):
|
||||
"""Test that XML and HTML Selector's behave differently"""
|
||||
# some text which is parsed differently by XML and HTML flavors
|
||||
text = '<div><img src="a.jpg"><p>Hello</div>'
|
||||
|
||||
self.assertEqual(self.xxs_cls(text=text).select("//div").extract(),
|
||||
[u'<div><img src="a.jpg"><p>Hello</p></img></div>'])
|
||||
|
||||
self.assertEqual(self.hxs_cls(text=text).select("//div").extract(),
|
||||
hs = self.sscls(text=text, type='html')
|
||||
self.assertEqual(hs.xpath("//div").extract(),
|
||||
[u'<div><img src="a.jpg"><p>Hello</p></div>'])
|
||||
|
||||
@libxml2debug
|
||||
def test_selector_nested(self):
|
||||
xs = self.sscls(text=text, type='xml')
|
||||
self.assertEqual(xs.xpath("//div").extract(),
|
||||
[u'<div><img src="a.jpg"><p>Hello</p></img></div>'])
|
||||
|
||||
def test_flavor_detection(self):
|
||||
text = '<div><img src="a.jpg"><p>Hello</div>'
|
||||
sel = self.sscls(XmlResponse('http://example.com', body=text))
|
||||
self.assertEqual(sel.type, 'xml')
|
||||
self.assertEqual(sel.xpath("//div").extract(),
|
||||
[u'<div><img src="a.jpg"><p>Hello</p></img></div>'])
|
||||
|
||||
sel = self.sscls(HtmlResponse('http://example.com', body=text))
|
||||
self.assertEqual(sel.type, 'html')
|
||||
self.assertEqual(sel.xpath("//div").extract(),
|
||||
[u'<div><img src="a.jpg"><p>Hello</p></div>'])
|
||||
|
||||
def test_nested_selectors(self):
|
||||
"""Nested selector tests"""
|
||||
body = """<body>
|
||||
<div class='one'>
|
||||
@ -97,26 +95,30 @@ class XPathSelectorTestCase(unittest.TestCase):
|
||||
</body>"""
|
||||
|
||||
response = HtmlResponse(url="http://example.com", body=body)
|
||||
x = self.hxs_cls(response)
|
||||
|
||||
divtwo = x.select('//div[@class="two"]')
|
||||
self.assertEqual(map(unicode.strip, divtwo.select("//li").extract()),
|
||||
x = self.sscls(response)
|
||||
divtwo = x.xpath('//div[@class="two"]')
|
||||
self.assertEqual(divtwo.xpath("//li").extract(),
|
||||
["<li>one</li>", "<li>two</li>", "<li>four</li>", "<li>five</li>", "<li>six</li>"])
|
||||
self.assertEqual(map(unicode.strip, divtwo.select("./ul/li").extract()),
|
||||
self.assertEqual(divtwo.xpath("./ul/li").extract(),
|
||||
["<li>four</li>", "<li>five</li>", "<li>six</li>"])
|
||||
self.assertEqual(map(unicode.strip, divtwo.select(".//li").extract()),
|
||||
self.assertEqual(divtwo.xpath(".//li").extract(),
|
||||
["<li>four</li>", "<li>five</li>", "<li>six</li>"])
|
||||
self.assertEqual(divtwo.select("./li").extract(),
|
||||
[])
|
||||
self.assertEqual(divtwo.xpath("./li").extract(), [])
|
||||
|
||||
def test_mixed_nested_selectors(self):
|
||||
body = '''<body>
|
||||
<div id=1>not<span>me</span></div>
|
||||
<div class="dos"><p>text</p><a href='#'>foo</a></div>
|
||||
</body>'''
|
||||
sel = self.sscls(text=body)
|
||||
self.assertEqual(sel.xpath('//div[@id="1"]').css('span::text').extract(), [u'me'])
|
||||
self.assertEqual(sel.css('#1').xpath('./span/text()').extract(), [u'me'])
|
||||
|
||||
@libxml2debug
|
||||
def test_dont_strip(self):
|
||||
hxs = self.hxs_cls(text='<div>fff: <a href="#">zzz</a></div>')
|
||||
self.assertEqual(hxs.select("//text()").extract(),
|
||||
[u'fff: ', u'zzz'])
|
||||
sel = self.sscls(text='<div>fff: <a href="#">zzz</a></div>')
|
||||
self.assertEqual(sel.xpath("//text()").extract(), [u'fff: ', u'zzz'])
|
||||
|
||||
@libxml2debug
|
||||
def test_selector_namespaces_simple(self):
|
||||
def test_namespaces_simple(self):
|
||||
body = """
|
||||
<test xmlns:somens="http://scrapy.org">
|
||||
<somens:a id="foo">take this</a>
|
||||
@ -125,14 +127,13 @@ class XPathSelectorTestCase(unittest.TestCase):
|
||||
"""
|
||||
|
||||
response = XmlResponse(url="http://example.com", body=body)
|
||||
x = self.xxs_cls(response)
|
||||
x = self.sscls(response)
|
||||
|
||||
x.register_namespace("somens", "http://scrapy.org")
|
||||
self.assertEqual(x.select("//somens:a/text()").extract(),
|
||||
self.assertEqual(x.xpath("//somens:a/text()").extract(),
|
||||
[u'take this'])
|
||||
|
||||
@libxml2debug
|
||||
def test_selector_namespaces_multiple(self):
|
||||
def test_namespaces_multiple(self):
|
||||
body = """<?xml version="1.0" encoding="UTF-8"?>
|
||||
<BrowseNode xmlns="http://webservices.amazon.com/AWSECommerceService/2005-10-05"
|
||||
xmlns:b="http://somens.com"
|
||||
@ -143,20 +144,18 @@ class XPathSelectorTestCase(unittest.TestCase):
|
||||
</BrowseNode>
|
||||
"""
|
||||
response = XmlResponse(url="http://example.com", body=body)
|
||||
x = self.xxs_cls(response)
|
||||
|
||||
x = self.sscls(response)
|
||||
x.register_namespace("xmlns", "http://webservices.amazon.com/AWSECommerceService/2005-10-05")
|
||||
x.register_namespace("p", "http://www.scrapy.org/product")
|
||||
x.register_namespace("b", "http://somens.com")
|
||||
self.assertEqual(len(x.select("//xmlns:TestTag")), 1)
|
||||
self.assertEqual(x.select("//b:Operation/text()").extract()[0], 'hello')
|
||||
self.assertEqual(x.select("//xmlns:TestTag/@b:att").extract()[0], 'value')
|
||||
self.assertEqual(x.select("//p:SecondTestTag/xmlns:price/text()").extract()[0], '90')
|
||||
self.assertEqual(x.select("//p:SecondTestTag").select("./xmlns:price/text()")[0].extract(), '90')
|
||||
self.assertEqual(x.select("//p:SecondTestTag/xmlns:material/text()").extract()[0], 'iron')
|
||||
self.assertEqual(len(x.xpath("//xmlns:TestTag")), 1)
|
||||
self.assertEqual(x.xpath("//b:Operation/text()").extract()[0], 'hello')
|
||||
self.assertEqual(x.xpath("//xmlns:TestTag/@b:att").extract()[0], 'value')
|
||||
self.assertEqual(x.xpath("//p:SecondTestTag/xmlns:price/text()").extract()[0], '90')
|
||||
self.assertEqual(x.xpath("//p:SecondTestTag").xpath("./xmlns:price/text()")[0].extract(), '90')
|
||||
self.assertEqual(x.xpath("//p:SecondTestTag/xmlns:material/text()").extract()[0], 'iron')
|
||||
|
||||
@libxml2debug
|
||||
def test_selector_re(self):
|
||||
def test_re(self):
|
||||
body = """<div>Name: Mary
|
||||
<ul>
|
||||
<li>Name: John</li>
|
||||
@ -165,47 +164,35 @@ class XPathSelectorTestCase(unittest.TestCase):
|
||||
<li>Age: 20</li>
|
||||
</ul>
|
||||
Age: 20
|
||||
</div>
|
||||
|
||||
"""
|
||||
</div>"""
|
||||
response = HtmlResponse(url="http://example.com", body=body)
|
||||
x = self.hxs_cls(response)
|
||||
x = self.sscls(response)
|
||||
|
||||
name_re = re.compile("Name: (\w+)")
|
||||
self.assertEqual(x.select("//ul/li").re(name_re),
|
||||
self.assertEqual(x.xpath("//ul/li").re(name_re),
|
||||
["John", "Paul"])
|
||||
self.assertEqual(x.select("//ul/li").re("Age: (\d+)"),
|
||||
self.assertEqual(x.xpath("//ul/li").re("Age: (\d+)"),
|
||||
["10", "20"])
|
||||
|
||||
@libxml2debug
|
||||
def test_selector_re_intl(self):
|
||||
def test_re_intl(self):
|
||||
body = """<div>Evento: cumplea\xc3\xb1os</div>"""
|
||||
response = HtmlResponse(url="http://example.com", body=body, encoding='utf-8')
|
||||
x = self.hxs_cls(response)
|
||||
self.assertEqual(x.select("//div").re("Evento: (\w+)"), [u'cumplea\xf1os'])
|
||||
x = self.sscls(response)
|
||||
self.assertEqual(x.xpath("//div").re("Evento: (\w+)"), [u'cumplea\xf1os'])
|
||||
|
||||
@libxml2debug
|
||||
def test_selector_over_text(self):
|
||||
hxs = self.hxs_cls(text='<root>lala</root>')
|
||||
self.assertEqual(hxs.extract(),
|
||||
u'<html><body><root>lala</root></body></html>')
|
||||
hs = self.sscls(text='<root>lala</root>')
|
||||
self.assertEqual(hs.extract(), u'<html><body><root>lala</root></body></html>')
|
||||
xs = self.sscls(text='<root>lala</root>', type='xml')
|
||||
self.assertEqual(xs.extract(), u'<root>lala</root>')
|
||||
self.assertEqual(xs.xpath('.').extract(), [u'<root>lala</root>'])
|
||||
|
||||
xxs = self.xxs_cls(text='<root>lala</root>')
|
||||
self.assertEqual(xxs.extract(),
|
||||
u'<root>lala</root>')
|
||||
|
||||
xxs = self.xxs_cls(text='<root>lala</root>')
|
||||
self.assertEqual(xxs.select('.').extract(),
|
||||
[u'<root>lala</root>'])
|
||||
|
||||
|
||||
@libxml2debug
|
||||
def test_selector_invalid_xpath(self):
|
||||
def test_invalid_xpath(self):
|
||||
response = XmlResponse(url="http://example.com", body="<html></html>")
|
||||
x = self.hxs_cls(response)
|
||||
x = self.sscls(response)
|
||||
xpath = "//test[@foo='bar]"
|
||||
try:
|
||||
x.select(xpath)
|
||||
x.xpath(xpath)
|
||||
except ValueError, e:
|
||||
assert xpath in str(e), "Exception message does not contain invalid xpath"
|
||||
except Exception:
|
||||
@ -213,7 +200,6 @@ class XPathSelectorTestCase(unittest.TestCase):
|
||||
else:
|
||||
raise AssertionError("A invalid XPath does not raise an exception")
|
||||
|
||||
@libxml2debug
|
||||
def test_http_header_encoding_precedence(self):
|
||||
# u'\xa3' = pound symbol in unicode
|
||||
# u'\xc2\xa3' = pound symbol in utf-8
|
||||
@ -229,71 +215,121 @@ class XPathSelectorTestCase(unittest.TestCase):
|
||||
|
||||
headers = {'Content-Type': ['text/html; charset=utf-8']}
|
||||
response = HtmlResponse(url="http://example.com", headers=headers, body=html_utf8)
|
||||
x = self.hxs_cls(response)
|
||||
self.assertEquals(x.select("//span[@id='blank']/text()").extract(),
|
||||
x = self.sscls(response)
|
||||
self.assertEquals(x.xpath("//span[@id='blank']/text()").extract(),
|
||||
[u'\xa3'])
|
||||
|
||||
@libxml2debug
|
||||
def test_empty_bodies(self):
|
||||
# shouldn't raise errors
|
||||
r1 = TextResponse('http://www.example.com', body='')
|
||||
self.hxs_cls(r1).select('//text()').extract()
|
||||
self.xxs_cls(r1).select('//text()').extract()
|
||||
self.sscls(r1).xpath('//text()').extract()
|
||||
|
||||
@libxml2debug
|
||||
def test_null_bytes(self):
|
||||
# shouldn't raise errors
|
||||
r1 = TextResponse('http://www.example.com', \
|
||||
body='<root>pre\x00post</root>', \
|
||||
encoding='utf-8')
|
||||
self.hxs_cls(r1).select('//text()').extract()
|
||||
self.xxs_cls(r1).select('//text()').extract()
|
||||
self.sscls(r1).xpath('//text()').extract()
|
||||
|
||||
@libxml2debug
|
||||
def test_badly_encoded_body(self):
|
||||
# \xe9 alone isn't valid utf8 sequence
|
||||
r1 = TextResponse('http://www.example.com', \
|
||||
body='<html><p>an Jos\xe9 de</p><html>', \
|
||||
encoding='utf-8')
|
||||
self.hxs_cls(r1).select('//text()').extract()
|
||||
self.xxs_cls(r1).select('//text()').extract()
|
||||
self.sscls(r1).xpath('//text()').extract()
|
||||
|
||||
@libxml2debug
|
||||
def test_select_on_unevaluable_nodes(self):
|
||||
r = self.hxs_cls(text=u'<span class="big">some text</span>')
|
||||
r = self.sscls(text=u'<span class="big">some text</span>')
|
||||
# Text node
|
||||
x1 = r.select('//text()')
|
||||
x1 = r.xpath('//text()')
|
||||
self.assertEquals(x1.extract(), [u'some text'])
|
||||
self.assertEquals(x1.select('.//b').extract(), [])
|
||||
self.assertEquals(x1.xpath('.//b').extract(), [])
|
||||
# Tag attribute
|
||||
x1 = r.select('//span/@class')
|
||||
x1 = r.xpath('//span/@class')
|
||||
self.assertEquals(x1.extract(), [u'big'])
|
||||
self.assertEquals(x1.select('.//text()').extract(), [])
|
||||
self.assertEquals(x1.xpath('.//text()').extract(), [])
|
||||
|
||||
@libxml2debug
|
||||
def test_select_on_text_nodes(self):
|
||||
r = self.hxs_cls(text=u'<div><b>Options:</b>opt1</div><div><b>Other</b>opt2</div>')
|
||||
x1 = r.select("//div/descendant::text()[preceding-sibling::b[contains(text(), 'Options')]]")
|
||||
r = self.sscls(text=u'<div><b>Options:</b>opt1</div><div><b>Other</b>opt2</div>')
|
||||
x1 = r.xpath("//div/descendant::text()[preceding-sibling::b[contains(text(), 'Options')]]")
|
||||
self.assertEquals(x1.extract(), [u'opt1'])
|
||||
|
||||
x1 = r.select("//div/descendant::text()/preceding-sibling::b[contains(text(), 'Options')]")
|
||||
x1 = r.xpath("//div/descendant::text()/preceding-sibling::b[contains(text(), 'Options')]")
|
||||
self.assertEquals(x1.extract(), [u'<b>Options:</b>'])
|
||||
|
||||
@libxml2debug
|
||||
def test_nested_select_on_text_nodes(self):
|
||||
# FIXME: does not work with lxml backend [upstream]
|
||||
r = self.hxs_cls(text=u'<div><b>Options:</b>opt1</div><div><b>Other</b>opt2</div>')
|
||||
x1 = r.select("//div/descendant::text()")
|
||||
x2 = x1.select("./preceding-sibling::b[contains(text(), 'Options')]")
|
||||
|
||||
r = self.sscls(text=u'<div><b>Options:</b>opt1</div><div><b>Other</b>opt2</div>')
|
||||
x1 = r.xpath("//div/descendant::text()")
|
||||
x2 = x1.xpath("./preceding-sibling::b[contains(text(), 'Options')]")
|
||||
self.assertEquals(x2.extract(), [u'<b>Options:</b>'])
|
||||
test_nested_select_on_text_nodes.skip = True
|
||||
test_nested_select_on_text_nodes.skip = "Text nodes lost parent node reference in lxml"
|
||||
|
||||
@libxml2debug
|
||||
def test_weakref_slots(self):
|
||||
"""Check that classes are using slots and are weak-referenceable"""
|
||||
for cls in [self.xs_cls, self.hxs_cls, self.xxs_cls]:
|
||||
x = cls()
|
||||
weakref.ref(x)
|
||||
assert not hasattr(x, '__dict__'), "%s does not use __slots__" % \
|
||||
x.__class__.__name__
|
||||
x = self.sscls()
|
||||
weakref.ref(x)
|
||||
assert not hasattr(x, '__dict__'), "%s does not use __slots__" % \
|
||||
x.__class__.__name__
|
||||
|
||||
def test_remove_namespaces(self):
|
||||
xml = """<?xml version="1.0" encoding="UTF-8"?>
|
||||
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-US" xmlns:media="http://search.yahoo.com/mrss/">
|
||||
<link type="text/html">
|
||||
<link type="application/atom+xml">
|
||||
</feed>
|
||||
"""
|
||||
sel = self.sscls(XmlResponse("http://example.com/feed.atom", body=xml))
|
||||
self.assertEqual(len(sel.xpath("//link")), 0)
|
||||
sel.remove_namespaces()
|
||||
self.assertEqual(len(sel.xpath("//link")), 2)
|
||||
|
||||
def test_remove_attributes_namespaces(self):
|
||||
xml = """<?xml version="1.0" encoding="UTF-8"?>
|
||||
<feed xmlns:atom="http://www.w3.org/2005/Atom" xml:lang="en-US" xmlns:media="http://search.yahoo.com/mrss/">
|
||||
<link atom:type="text/html">
|
||||
<link atom:type="application/atom+xml">
|
||||
</feed>
|
||||
"""
|
||||
sel = self.sscls(XmlResponse("http://example.com/feed.atom", body=xml))
|
||||
self.assertEqual(len(sel.xpath("//link/@type")), 0)
|
||||
sel.remove_namespaces()
|
||||
self.assertEqual(len(sel.xpath("//link/@type")), 2)
|
||||
|
||||
|
||||
class DeprecatedXpathSelectorTest(unittest.TestCase):
|
||||
|
||||
text = '<div><img src="a.jpg"><p>Hello</div>'
|
||||
|
||||
def test_warnings(self):
|
||||
for cls in XPathSelector, HtmlXPathSelector, XPathSelector:
|
||||
with warnings.catch_warnings(record=True) as w:
|
||||
warnings.simplefilter('always')
|
||||
hs = cls(text=self.text)
|
||||
assert len(w) == 1, w
|
||||
assert issubclass(w[0].category, ScrapyDeprecationWarning)
|
||||
assert 'deprecated' in str(w[-1].message)
|
||||
hs.select("//div").extract()
|
||||
assert issubclass(w[1].category, ScrapyDeprecationWarning)
|
||||
assert 'deprecated' in str(w[-1].message)
|
||||
|
||||
def test_xpathselector(self):
|
||||
with warnings.catch_warnings(record=True):
|
||||
hs = XPathSelector(text=self.text)
|
||||
self.assertEqual(hs.select("//div").extract(),
|
||||
[u'<div><img src="a.jpg"><p>Hello</p></div>'])
|
||||
self.assertRaises(RuntimeError, hs.css, 'div')
|
||||
|
||||
def test_htmlxpathselector(self):
|
||||
with warnings.catch_warnings(record=True):
|
||||
hs = HtmlXPathSelector(text=self.text)
|
||||
self.assertEqual(hs.select("//div").extract(),
|
||||
[u'<div><img src="a.jpg"><p>Hello</p></div>'])
|
||||
self.assertRaises(RuntimeError, hs.css, 'div')
|
||||
|
||||
def test_xmlxpathselector(self):
|
||||
with warnings.catch_warnings(record=True):
|
||||
xs = XmlXPathSelector(text=self.text)
|
||||
self.assertEqual(xs.select("//div").extract(),
|
||||
[u'<div><img src="a.jpg"><p>Hello</p></img></div>'])
|
||||
self.assertRaises(RuntimeError, xs.css, 'div')
|
||||
|
153
scrapy/tests/test_selector_csstranslator.py
Normal file
153
scrapy/tests/test_selector_csstranslator.py
Normal file
@ -0,0 +1,153 @@
|
||||
"""
|
||||
Selector tests for cssselect backend
|
||||
"""
|
||||
from twisted.trial import unittest
|
||||
from scrapy.http import HtmlResponse
|
||||
from scrapy.selector.csstranslator import ScrapyHTMLTranslator
|
||||
from scrapy.selector import Selector
|
||||
from cssselect.parser import SelectorSyntaxError
|
||||
from cssselect.xpath import ExpressionError
|
||||
|
||||
|
||||
HTMLBODY = '''
|
||||
<html>
|
||||
<body>
|
||||
<div>
|
||||
<a id="name-anchor" name="foo"></a>
|
||||
<a id="tag-anchor" rel="tag" href="http://localhost/foo">link</a>
|
||||
<a id="nofollow-anchor" rel="nofollow" href="https://example.org"> link</a>
|
||||
<p id="paragraph">
|
||||
lorem ipsum text
|
||||
<b id="p-b">hi</b> <em id="p-em">there</em>
|
||||
<b id="p-b2">guy</b>
|
||||
<input type="checkbox" id="checkbox-unchecked" />
|
||||
<input type="checkbox" id="checkbox-disabled" disabled="" />
|
||||
<input type="text" id="text-checked" checked="checked" />
|
||||
<input type="hidden" />
|
||||
<input type="hidden" disabled="disabled" />
|
||||
<input type="checkbox" id="checkbox-checked" checked="checked" />
|
||||
<input type="checkbox" id="checkbox-disabled-checked"
|
||||
disabled="disabled" checked="checked" />
|
||||
<fieldset id="fieldset" disabled="disabled">
|
||||
<input type="checkbox" id="checkbox-fieldset-disabled" />
|
||||
<input type="hidden" />
|
||||
</fieldset>
|
||||
</p>
|
||||
<map name="dummymap">
|
||||
<area shape="circle" coords="200,250,25" href="foo.html" id="area-href" />
|
||||
<area shape="default" id="area-nohref" />
|
||||
</map>
|
||||
</div>
|
||||
<div class="cool-footer" id="foobar-div" foobar="ab bc cde">
|
||||
<span id="foobar-span">foo ter</span>
|
||||
</div>
|
||||
</body></html>
|
||||
'''
|
||||
|
||||
|
||||
class TranslatorMixinTest(unittest.TestCase):
|
||||
|
||||
tr_cls = ScrapyHTMLTranslator
|
||||
|
||||
def setUp(self):
|
||||
self.tr = self.tr_cls()
|
||||
self.c2x = self.tr.css_to_xpath
|
||||
|
||||
def test_attr_function(self):
|
||||
cases = [
|
||||
('::attr(name)', u'descendant-or-self::*/@name'),
|
||||
('a::attr(href)', u'descendant-or-self::a/@href'),
|
||||
('a ::attr(img)', u'descendant-or-self::a/descendant-or-self::*/@img'),
|
||||
('a > ::attr(class)', u'descendant-or-self::a/*/@class'),
|
||||
]
|
||||
for css, xpath in cases:
|
||||
self.assertEqual(self.c2x(css), xpath, css)
|
||||
|
||||
def test_attr_function_exception(self):
|
||||
cases = [
|
||||
('::attr(12)', ExpressionError),
|
||||
('::attr(34test)', ExpressionError),
|
||||
('::attr(@href)', SelectorSyntaxError),
|
||||
]
|
||||
for css, exc in cases:
|
||||
self.assertRaises(exc, self.c2x, css)
|
||||
|
||||
def test_text_pseudo_element(self):
|
||||
cases = [
|
||||
('::text', u'descendant-or-self::text()'),
|
||||
('p::text', u'descendant-or-self::p/text()'),
|
||||
('p ::text', u'descendant-or-self::p/descendant-or-self::text()'),
|
||||
('#id::text', u"descendant-or-self::*[@id = 'id']/text()"),
|
||||
('p#id::text', u"descendant-or-self::p[@id = 'id']/text()"),
|
||||
('p#id ::text', u"descendant-or-self::p[@id = 'id']/descendant-or-self::text()"),
|
||||
('p#id > ::text', u"descendant-or-self::p[@id = 'id']/*/text()"),
|
||||
('p#id ~ ::text', u"descendant-or-self::p[@id = 'id']/following-sibling::*/text()"),
|
||||
('a[href]::text', u'descendant-or-self::a[@href]/text()'),
|
||||
('a[href] ::text', u'descendant-or-self::a[@href]/descendant-or-self::text()'),
|
||||
('p::text, a::text', u"descendant-or-self::p/text() | descendant-or-self::a/text()"),
|
||||
]
|
||||
for css, xpath in cases:
|
||||
self.assertEqual(self.c2x(css), xpath, css)
|
||||
|
||||
def test_pseudo_function_exception(self):
|
||||
cases = [
|
||||
('::attribute(12)', ExpressionError),
|
||||
('::text()', ExpressionError),
|
||||
('::attr(@href)', SelectorSyntaxError),
|
||||
]
|
||||
for css, exc in cases:
|
||||
self.assertRaises(exc, self.c2x, css)
|
||||
|
||||
def test_unknown_pseudo_element(self):
|
||||
cases = [
|
||||
('::text-node', ExpressionError),
|
||||
]
|
||||
for css, exc in cases:
|
||||
self.assertRaises(exc, self.c2x, css)
|
||||
|
||||
def test_unknown_pseudo_class(self):
|
||||
cases = [
|
||||
(':text', ExpressionError),
|
||||
(':attribute(name)', ExpressionError),
|
||||
]
|
||||
for css, exc in cases:
|
||||
self.assertRaises(exc, self.c2x, css)
|
||||
|
||||
|
||||
class CSSSelectorTest(unittest.TestCase):
|
||||
|
||||
sscls = Selector
|
||||
|
||||
def setUp(self):
|
||||
self.htmlresponse = HtmlResponse('http://example.com', body=HTMLBODY)
|
||||
self.sel = self.sscls(self.htmlresponse)
|
||||
|
||||
def x(self, *a, **kw):
|
||||
return [v.strip() for v in self.sel.css(*a, **kw).extract() if v.strip()]
|
||||
|
||||
def test_selector_simple(self):
|
||||
for x in self.sel.css('input'):
|
||||
self.assertTrue(isinstance(x, self.sel.__class__), x)
|
||||
self.assertEqual(self.sel.css('input').extract(),
|
||||
[x.extract() for x in self.sel.css('input')])
|
||||
|
||||
def test_text_pseudo_element(self):
|
||||
self.assertEqual(self.x('#p-b2'), [u'<b id="p-b2">guy</b>'])
|
||||
self.assertEqual(self.x('#p-b2::text'), [u'guy'])
|
||||
self.assertEqual(self.x('#p-b2 ::text'), [u'guy'])
|
||||
self.assertEqual(self.x('#paragraph::text'), [u'lorem ipsum text'])
|
||||
self.assertEqual(self.x('#paragraph ::text'), [u'lorem ipsum text', u'hi', u'there', u'guy'])
|
||||
self.assertEqual(self.x('p::text'), [u'lorem ipsum text'])
|
||||
self.assertEqual(self.x('p ::text'), [u'lorem ipsum text', u'hi', u'there', u'guy'])
|
||||
|
||||
def test_attribute_function(self):
|
||||
self.assertEqual(self.x('#p-b2::attr(id)'), [u'p-b2'])
|
||||
self.assertEqual(self.x('.cool-footer::attr(class)'), [u'cool-footer'])
|
||||
self.assertEqual(self.x('.cool-footer ::attr(id)'), [u'foobar-div', u'foobar-span'])
|
||||
self.assertEqual(self.x('map[name="dummymap"] ::attr(shape)'), [u'circle', u'default'])
|
||||
|
||||
def test_nested_selector(self):
|
||||
self.assertEqual(self.sel.css('p').css('b::text').extract(),
|
||||
[u'hi', u'guy'])
|
||||
self.assertEqual(self.sel.css('div').css('area:last-child').extract(),
|
||||
[u'<area shape="default" id="area-nohref">'])
|
@ -1,98 +0,0 @@
|
||||
"""
|
||||
Selectors tests, specific for libxml2 backend
|
||||
"""
|
||||
|
||||
from twisted.trial import unittest
|
||||
from scrapy import optional_features
|
||||
|
||||
|
||||
from scrapy.http import TextResponse, HtmlResponse, XmlResponse
|
||||
from scrapy.selector.libxml2sel import XmlXPathSelector, HtmlXPathSelector, \
|
||||
XPathSelector
|
||||
from scrapy.selector.libxml2document import Libxml2Document
|
||||
from scrapy.utils.test import libxml2debug
|
||||
from scrapy.tests import test_selector
|
||||
|
||||
|
||||
class Libxml2XPathSelectorTestCase(test_selector.XPathSelectorTestCase):
|
||||
|
||||
xs_cls = XPathSelector
|
||||
hxs_cls = HtmlXPathSelector
|
||||
xxs_cls = XmlXPathSelector
|
||||
|
||||
skip = 'libxml2' not in optional_features
|
||||
|
||||
@libxml2debug
|
||||
def test_null_bytes(self):
|
||||
hxs = HtmlXPathSelector(text='<root>la\x00la</root>')
|
||||
self.assertEqual(hxs.extract(),
|
||||
u'<html><body><root>lala</root></body></html>')
|
||||
|
||||
xxs = XmlXPathSelector(text='<root>la\x00la</root>')
|
||||
self.assertEqual(xxs.extract(),
|
||||
u'<root>lala</root>')
|
||||
|
||||
@libxml2debug
|
||||
def test_unquote(self):
|
||||
xmldoc = '\n'.join((
|
||||
'<root>',
|
||||
' lala',
|
||||
' <node>',
|
||||
' blabla&more<!--comment-->a<b>test</b>oh',
|
||||
' <![CDATA[lalalal&ppppp<b>PPPP</b>ppp&la]]>',
|
||||
' </node>',
|
||||
' pff',
|
||||
'</root>'))
|
||||
xxs = XmlXPathSelector(text=xmldoc)
|
||||
|
||||
self.assertEqual(xxs.extract_unquoted(), u'')
|
||||
|
||||
self.assertEqual(xxs.select('/root').extract_unquoted(), [u''])
|
||||
self.assertEqual(xxs.select('/root/text()').extract_unquoted(), [
|
||||
u'\n lala\n ',
|
||||
u'\n pff\n'])
|
||||
|
||||
self.assertEqual(xxs.select('//*').extract_unquoted(), [u'', u'', u''])
|
||||
self.assertEqual(xxs.select('//text()').extract_unquoted(), [
|
||||
u'\n lala\n ',
|
||||
u'\n blabla&more',
|
||||
u'a',
|
||||
u'test',
|
||||
u'oh\n ',
|
||||
u'lalalal&ppppp<b>PPPP</b>ppp&la',
|
||||
u'\n ',
|
||||
u'\n pff\n'])
|
||||
|
||||
|
||||
class Libxml2DocumentTest(unittest.TestCase):
|
||||
|
||||
skip = 'libxml2' not in optional_features
|
||||
|
||||
@libxml2debug
|
||||
def test_response_libxml2_caching(self):
|
||||
r1 = HtmlResponse('http://www.example.com', body='<html><head></head><body></body></html>')
|
||||
r2 = r1.copy()
|
||||
|
||||
doc1 = Libxml2Document(r1)
|
||||
doc2 = Libxml2Document(r1)
|
||||
doc3 = Libxml2Document(r2)
|
||||
|
||||
# make sure it's cached
|
||||
assert doc1 is doc2
|
||||
assert doc1.xmlDoc is doc2.xmlDoc
|
||||
assert doc1 is not doc3
|
||||
assert doc1.xmlDoc is not doc3.xmlDoc
|
||||
|
||||
# don't leave libxml2 documents in memory to avoid wrong libxml2 leaks reports
|
||||
del doc1, doc2, doc3
|
||||
|
||||
@libxml2debug
|
||||
def test_null_char(self):
|
||||
# make sure bodies with null char ('\x00') don't raise a TypeError exception
|
||||
self.body_content = 'test problematic \x00 body'
|
||||
response = TextResponse('http://example.com/catalog/product/blabla-123',
|
||||
headers={'Content-Type': 'text/plain; charset=utf-8'}, body=self.body_content)
|
||||
Libxml2Document(response)
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
@ -1,64 +0,0 @@
|
||||
"""
|
||||
Selectors tests, specific for lxml backend
|
||||
"""
|
||||
|
||||
import unittest
|
||||
from scrapy.tests import test_selector
|
||||
from scrapy.http import TextResponse, HtmlResponse, XmlResponse
|
||||
from scrapy.selector.lxmldocument import LxmlDocument
|
||||
from scrapy.selector.lxmlsel import XmlXPathSelector, HtmlXPathSelector, XPathSelector
|
||||
|
||||
|
||||
class LxmlXPathSelectorTestCase(test_selector.XPathSelectorTestCase):
|
||||
|
||||
xs_cls = XPathSelector
|
||||
hxs_cls = HtmlXPathSelector
|
||||
xxs_cls = XmlXPathSelector
|
||||
|
||||
def test_remove_namespaces(self):
|
||||
xml = """<?xml version="1.0" encoding="UTF-8"?>
|
||||
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-US" xmlns:media="http://search.yahoo.com/mrss/">
|
||||
<link type="text/html">
|
||||
<link type="application/atom+xml">
|
||||
</feed>
|
||||
"""
|
||||
xxs = XmlXPathSelector(XmlResponse("http://example.com/feed.atom", body=xml))
|
||||
self.assertEqual(len(xxs.select("//link")), 0)
|
||||
xxs.remove_namespaces()
|
||||
self.assertEqual(len(xxs.select("//link")), 2)
|
||||
|
||||
def test_remove_attributes_namespaces(self):
|
||||
xml = """<?xml version="1.0" encoding="UTF-8"?>
|
||||
<feed xmlns:atom="http://www.w3.org/2005/Atom" xml:lang="en-US" xmlns:media="http://search.yahoo.com/mrss/">
|
||||
<link atom:type="text/html">
|
||||
<link atom:type="application/atom+xml">
|
||||
</feed>
|
||||
"""
|
||||
xxs = XmlXPathSelector(XmlResponse("http://example.com/feed.atom", body=xml))
|
||||
self.assertEqual(len(xxs.select("//link/@type")), 0)
|
||||
xxs.remove_namespaces()
|
||||
self.assertEqual(len(xxs.select("//link/@type")), 2)
|
||||
|
||||
class Libxml2DocumentTest(unittest.TestCase):
|
||||
|
||||
def test_caching(self):
|
||||
r1 = HtmlResponse('http://www.example.com', body='<html><head></head><body></body></html>')
|
||||
r2 = r1.copy()
|
||||
|
||||
doc1 = LxmlDocument(r1)
|
||||
doc2 = LxmlDocument(r1)
|
||||
doc3 = LxmlDocument(r2)
|
||||
|
||||
# make sure it's cached
|
||||
assert doc1 is doc2
|
||||
assert doc1 is not doc3
|
||||
|
||||
# don't leave documents in memory to avoid wrong libxml2 leaks reports
|
||||
del doc1, doc2, doc3
|
||||
|
||||
def test_null_char(self):
|
||||
# make sure bodies with null char ('\x00') don't raise a TypeError exception
|
||||
self.body_content = 'test problematic \x00 body'
|
||||
response = TextResponse('http://example.com/catalog/product/blabla-123',
|
||||
headers={'Content-Type': 'text/plain; charset=utf-8'}, body=self.body_content)
|
||||
LxmlDocument(response)
|
26
scrapy/tests/test_selector_lxmldocument.py
Normal file
26
scrapy/tests/test_selector_lxmldocument.py
Normal file
@ -0,0 +1,26 @@
|
||||
import unittest
|
||||
from scrapy.selector.lxmldocument import LxmlDocument
|
||||
from scrapy.http import TextResponse, HtmlResponse
|
||||
|
||||
|
||||
class LxmlDocumentTest(unittest.TestCase):
|
||||
|
||||
def test_caching(self):
|
||||
r1 = HtmlResponse('http://www.example.com', body='<html><head></head><body></body></html>')
|
||||
r2 = r1.copy()
|
||||
|
||||
doc1 = LxmlDocument(r1)
|
||||
doc2 = LxmlDocument(r1)
|
||||
doc3 = LxmlDocument(r2)
|
||||
|
||||
# make sure it's cached
|
||||
assert doc1 is doc2
|
||||
assert doc1 is not doc3
|
||||
|
||||
def test_null_char(self):
|
||||
# make sure bodies with null char ('\x00') don't raise a TypeError exception
|
||||
body = 'test problematic \x00 body'
|
||||
response = TextResponse('http://example.com/catalog/product/blabla-123',
|
||||
headers={'Content-Type': 'text/plain; charset=utf-8'},
|
||||
body=body)
|
||||
LxmlDocument(response)
|
@ -70,10 +70,10 @@ class XMLFeedSpiderTest(BaseSpiderTest):
|
||||
|
||||
def parse_node(self, response, selector):
|
||||
yield {
|
||||
'loc': selector.select('a:loc/text()').extract(),
|
||||
'updated': selector.select('b:updated/text()').extract(),
|
||||
'other': selector.select('other/@value').extract(),
|
||||
'custom': selector.select('other/@b:custom').extract(),
|
||||
'loc': selector.xpath('a:loc/text()').extract(),
|
||||
'updated': selector.xpath('b:updated/text()').extract(),
|
||||
'other': selector.xpath('other/@value').extract(),
|
||||
'custom': selector.xpath('other/@b:custom').extract(),
|
||||
}
|
||||
|
||||
for iterator in ('iternodes', 'xml'):
|
||||
|
@ -28,7 +28,7 @@ class XmliterTestCase(unittest.TestCase):
|
||||
response = XmlResponse(url="http://example.com", body=body)
|
||||
attrs = []
|
||||
for x in self.xmliter(response, 'product'):
|
||||
attrs.append((x.select("@id").extract(), x.select("name/text()").extract(), x.select("./type/text()").extract()))
|
||||
attrs.append((x.xpath("@id").extract(), x.xpath("name/text()").extract(), x.xpath("./type/text()").extract()))
|
||||
|
||||
self.assertEqual(attrs,
|
||||
[(['001'], ['Name 1'], ['Type 1']), (['002'], ['Name 2'], ['Type 2'])])
|
||||
@ -36,7 +36,7 @@ class XmliterTestCase(unittest.TestCase):
|
||||
def test_xmliter_text(self):
|
||||
body = u"""<?xml version="1.0" encoding="UTF-8"?><products><product>one</product><product>two</product></products>"""
|
||||
|
||||
self.assertEqual([x.select("text()").extract() for x in self.xmliter(body, 'product')],
|
||||
self.assertEqual([x.xpath("text()").extract() for x in self.xmliter(body, 'product')],
|
||||
[[u'one'], [u'two']])
|
||||
|
||||
def test_xmliter_namespaces(self):
|
||||
@ -63,15 +63,15 @@ class XmliterTestCase(unittest.TestCase):
|
||||
|
||||
node = my_iter.next()
|
||||
node.register_namespace('g', 'http://base.google.com/ns/1.0')
|
||||
self.assertEqual(node.select('title/text()').extract(), ['Item 1'])
|
||||
self.assertEqual(node.select('description/text()').extract(), ['This is item 1'])
|
||||
self.assertEqual(node.select('link/text()').extract(), ['http://www.mydummycompany.com/items/1'])
|
||||
self.assertEqual(node.select('g:image_link/text()').extract(), ['http://www.mydummycompany.com/images/item1.jpg'])
|
||||
self.assertEqual(node.select('g:id/text()').extract(), ['ITEM_1'])
|
||||
self.assertEqual(node.select('g:price/text()').extract(), ['400'])
|
||||
self.assertEqual(node.select('image_link/text()').extract(), [])
|
||||
self.assertEqual(node.select('id/text()').extract(), [])
|
||||
self.assertEqual(node.select('price/text()').extract(), [])
|
||||
self.assertEqual(node.xpath('title/text()').extract(), ['Item 1'])
|
||||
self.assertEqual(node.xpath('description/text()').extract(), ['This is item 1'])
|
||||
self.assertEqual(node.xpath('link/text()').extract(), ['http://www.mydummycompany.com/items/1'])
|
||||
self.assertEqual(node.xpath('g:image_link/text()').extract(), ['http://www.mydummycompany.com/images/item1.jpg'])
|
||||
self.assertEqual(node.xpath('g:id/text()').extract(), ['ITEM_1'])
|
||||
self.assertEqual(node.xpath('g:price/text()').extract(), ['400'])
|
||||
self.assertEqual(node.xpath('image_link/text()').extract(), [])
|
||||
self.assertEqual(node.xpath('id/text()').extract(), [])
|
||||
self.assertEqual(node.xpath('price/text()').extract(), [])
|
||||
|
||||
def test_xmliter_exception(self):
|
||||
body = u"""<?xml version="1.0" encoding="UTF-8"?><products><product>one</product><product>two</product></products>"""
|
||||
@ -123,9 +123,9 @@ class LxmlXmliterTestCase(XmliterTestCase):
|
||||
|
||||
namespace_iter = self.xmliter(response, 'image_link', 'http://base.google.com/ns/1.0')
|
||||
node = namespace_iter.next()
|
||||
self.assertEqual(node.select('text()').extract(), ['http://www.mydummycompany.com/images/item1.jpg'])
|
||||
self.assertEqual(node.xpath('text()').extract(), ['http://www.mydummycompany.com/images/item1.jpg'])
|
||||
node = namespace_iter.next()
|
||||
self.assertEqual(node.select('text()').extract(), ['http://www.mydummycompany.com/images/item2.jpg'])
|
||||
self.assertEqual(node.xpath('text()').extract(), ['http://www.mydummycompany.com/images/item2.jpg'])
|
||||
|
||||
|
||||
class UtilsCsvTestCase(unittest.TestCase):
|
||||
|
@ -2,14 +2,14 @@ import re, csv
|
||||
from cStringIO import StringIO
|
||||
|
||||
from scrapy.http import TextResponse
|
||||
from scrapy.selector import XmlXPathSelector
|
||||
from scrapy.selector import Selector
|
||||
from scrapy import log
|
||||
from scrapy.utils.python import re_rsearch, str_to_unicode
|
||||
from scrapy.utils.response import body_or_str
|
||||
|
||||
|
||||
def xmliter(obj, nodename):
|
||||
"""Return a iterator of XPathSelector's over all nodes of a XML document,
|
||||
"""Return a iterator of Selector's over all nodes of a XML document,
|
||||
given tha name of the node to iterate. Useful for parsing XML feeds.
|
||||
|
||||
obj can be:
|
||||
@ -29,7 +29,7 @@ def xmliter(obj, nodename):
|
||||
r = re.compile(r"<%s[\s>].*?</%s>" % (nodename, nodename), re.DOTALL)
|
||||
for match in r.finditer(text):
|
||||
nodetext = header_start + match.group() + header_end
|
||||
yield XmlXPathSelector(text=nodetext).select('//' + nodename)[0]
|
||||
yield Selector(text=nodetext, type='xml').xpath('//' + nodename)[0]
|
||||
|
||||
|
||||
def csviter(obj, delimiter=None, headers=None, encoding=None):
|
||||
|
@ -6,30 +6,6 @@ import os
|
||||
|
||||
from twisted.trial.unittest import SkipTest
|
||||
|
||||
def libxml2debug(testfunction):
|
||||
"""Decorator for debugging libxml2 memory leaks inside a function.
|
||||
|
||||
We've found libxml2 memory leaks are something very weird, and can happen
|
||||
sometimes depending on the order where tests are run. So this decorator
|
||||
enables libxml2 memory leaks debugging only when the environment variable
|
||||
LIBXML2_DEBUGLEAKS is set.
|
||||
|
||||
"""
|
||||
try:
|
||||
import libxml2
|
||||
except ImportError:
|
||||
return testfunction
|
||||
def newfunc(*args, **kwargs):
|
||||
libxml2.debugMemory(1)
|
||||
testfunction(*args, **kwargs)
|
||||
libxml2.cleanupParser()
|
||||
leaked_bytes = libxml2.debugMemory(0)
|
||||
assert leaked_bytes == 0, "libxml2 memory leak detected: %d bytes" % leaked_bytes
|
||||
|
||||
if 'LIBXML2_DEBUGLEAKS' in os.environ:
|
||||
return newfunc
|
||||
else:
|
||||
return testfunction
|
||||
|
||||
def assert_aws_environ():
|
||||
"""Asserts the current environment is suitable for running AWS testsi.
|
||||
|
@ -91,9 +91,9 @@ This is a working code sample that covers just the basics.
|
||||
""" Pull the text label out of selected markup
|
||||
|
||||
:param entity: Found markup
|
||||
:type entity: HtmlXPathSelector
|
||||
:type entity: Selector
|
||||
"""
|
||||
label = ' '.join(entity.select('.//text()').extract())
|
||||
label = ' '.join(entity.xpath('.//text()').extract())
|
||||
label = label.encode('ascii', 'xmlcharrefreplace') if label else ''
|
||||
label = label.strip(' ') if ' ' in label else label
|
||||
label = label.strip(':') if ':' in label else label
|
||||
@ -108,7 +108,7 @@ This is a working code sample that covers just the basics.
|
||||
:return: The list of selectors
|
||||
:rtype: list
|
||||
"""
|
||||
return self.selector.select(self.base_xpath + xpath)
|
||||
return self.selector.xpath(self.base_xpath + xpath)
|
||||
|
||||
def parse_dl(self, xpath=u'//dl'):
|
||||
""" Look for the specified definition list pattern and store all found
|
||||
@ -120,7 +120,7 @@ This is a working code sample that covers just the basics.
|
||||
for term in self._get_entities(xpath + '/dt'):
|
||||
label = self._get_label(term)
|
||||
if label and label not in self.ignore:
|
||||
value = term.select('following-sibling::dd[1]//text()')
|
||||
value = term.xpath('following-sibling::dd[1]//text()')
|
||||
if value:
|
||||
self.add_value(label, value.extract(),
|
||||
MapCompose(lambda v: v.strip()))
|
||||
|
2
setup.py
2
setup.py
@ -122,6 +122,6 @@ try:
|
||||
except ImportError:
|
||||
from distutils.core import setup
|
||||
else:
|
||||
setup_args['install_requires'] = ['Twisted>=10.0.0', 'w3lib>=1.2', 'queuelib', 'lxml', 'pyOpenSSL']
|
||||
setup_args['install_requires'] = ['Twisted>=10.0.0', 'w3lib>=1.2', 'queuelib', 'lxml', 'pyOpenSSL', 'cssselect>0.8']
|
||||
|
||||
setup(**setup_args)
|
||||
|
Loading…
x
Reference in New Issue
Block a user