1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-23 21:04:20 +00:00

improve documentation about removing namespaces

This commit is contained in:
Pablo Hoffman 2013-01-18 12:35:30 -02:00
parent 1ba04b1fc3
commit 6ab8afb992
3 changed files with 34 additions and 6 deletions

View File

@ -268,6 +268,11 @@ section of the site (which varies each time). In that case, the credentials to
log in would be settings, while the url of the section to scrape would be a
spider argument.
I'm scraping a XML document and my XPath selector doesn't return any items
--------------------------------------------------------------------------
You may need to remove namespaces. See :ref:`removing-namespaces`.
.. _user agents: http://en.wikipedia.org/wiki/User_agent
.. _LIFO: http://en.wikipedia.org/wiki/LIFO
.. _DFO order: http://en.wikipedia.org/wiki/Depth-first_search

View File

@ -6,7 +6,7 @@ Release notes
0.18 (unreleased)
-----------------
- added :meth:`XmlXPathSelector.remove_namespaces` which allows to remove all namespaces from XML documents for convenience (to work with namespace-less XPaths). Documented in :ref:`topics-selectors`.
- added :meth:`XPathSelector.remove_namespaces` which allows to remove all namespaces from XML documents for convenience (to work with namespace-less XPaths). Documented in :ref:`topics-selectors`.
- several improvements to spider contracts
- New default middleware named MetaRefreshMiddldeware that handles meta-refresh html tag redirections,
MetaRefreshMiddldeware and RedirectMiddleware have different priorities to address #62

View File

@ -382,24 +382,47 @@ instantiated with a :class:`~scrapy.http.Response` object like this::
x.register_namespace("g", "http://base.google.com/ns/1.0")
x.select("//g:price").extract()
.. _removing-namespaces:
Removing namespaces
~~~~~~~~~~~~~~~~~~~
When dealing with scraping projects, it is often quite convenient to get rid of
namespaces altogether and just work with element names. You can use the
:meth:`XmlXPathSelector.remove_namespaces` method for that.
namespaces altogether and just work with element names, to write more
simple/convenient XPaths. You can use the
:meth:`XPathSelector.remove_namespaces` method for that.
Here's an example that illustrates this to parse the Github blog atom feed::
Let's show an example that illustrates this with Github blog atom feed.
First, we open the shell with the url we want to scrape::
$ scrapy shell https://github.com/blog.atom
# ...
Once in the shell we can try selecting all ``<link>`` objects and see that it
doesn't work (because the Atom XML namespace is obfuscating those nodes)::
>>> xxs.select("//link")
[]
But once we call the :meth:`XPathSelector.remove_namespaces` method, all
nodes can be accessed directly by their names::
>>> xxs.remove_namespaces()
>>> xxs.select("//link")
[<XmlXPathSelector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
<XmlXPathSelector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
...
]
If you wonder why the namespace removal procedure is not always called, instead
of having to call it manually. This is because of two reasons which, in order
of relevance, are:
1. removing namespaces requires to iterate and modify all nodes in the
document, which is a reasonably expensive operation to performs for all
documents crawled by Scrapy
2. there could be some cases where using namespaces is actually required, in
case some element names clash between namespaces. These cases are very rare
though.
.. _Google Base XML feed: http://base.google.com/support/bin/answer.py?hl=en&answer=59461