improve documentation about removing namespaces

2025-02-23 21:04:20 +00:00 · 2013-01-18 12:35:30 -02:00 · 2013-01-18 12:35:30 -02:00 · 6ab8afb992
commit 6ab8afb992
parent 1ba04b1fc3
3 changed files with 34 additions and 6 deletions
--- a/docs/faq.rst
+++ b/docs/faq.rst
@ -268,6 +268,11 @@ section of the site (which varies each time). In that case, the credentials to
 log in would be settings, while the url of the section to scrape would be a
 spider argument.

+I'm scraping a XML document and my XPath selector doesn't return any items
+--------------------------------------------------------------------------
+
+You may need to remove namespaces. See :ref:`removing-namespaces`.
+
 .. _user agents: http://en.wikipedia.org/wiki/User_agent
 .. _LIFO: http://en.wikipedia.org/wiki/LIFO
 .. _DFO order: http://en.wikipedia.org/wiki/Depth-first_search
--- a/docs/news.rst
+++ b/docs/news.rst
@ -6,7 +6,7 @@ Release notes
 0.18 (unreleased)
 -----------------

- added :meth:`XmlXPathSelector.remove_namespaces` which allows to remove all namespaces from XML documents for convenience (to work with namespace-less XPaths). Documented in :ref:`topics-selectors`.
+- added :meth:`XPathSelector.remove_namespaces` which allows to remove all namespaces from XML documents for convenience (to work with namespace-less XPaths). Documented in :ref:`topics-selectors`.
 - several improvements to spider contracts
 - New default middleware named MetaRefreshMiddldeware that handles meta-refresh html tag redirections,
  MetaRefreshMiddldeware and RedirectMiddleware have different priorities to address #62
--- a/docs/topics/selectors.rst
+++ b/docs/topics/selectors.rst
@ -382,24 +382,47 @@ instantiated with a :class:`~scrapy.http.Response` object like this::
      x.register_namespace("g", "http://base.google.com/ns/1.0")
      x.select("//g:price").extract()

+.. _removing-namespaces:
+
 Removing namespaces
 ~~~~~~~~~~~~~~~~~~~

 When dealing with scraping projects, it is often quite convenient to get rid of
-namespaces altogether and just work with element names. You can use the
-:meth:`XmlXPathSelector.remove_namespaces` method for that.
+namespaces altogether and just work with element names, to write more
+simple/convenient XPaths. You can use the
+:meth:`XPathSelector.remove_namespaces` method for that.

-Here's an example that illustrates this to parse the Github blog atom feed::
+Let's show an example that illustrates this with Github blog atom feed.
+
+First, we open the shell with the url we want to scrape::

    $ scrapy shell https://github.com/blog.atom
-    # ...
+
+Once in the shell we can try selecting all ``<link>`` objects and see that it
+doesn't work (because the Atom XML namespace is obfuscating those nodes)::
+
    >>> xxs.select("//link")
    []
+
+But once we call the :meth:`XPathSelector.remove_namespaces` method, all
+nodes can be accessed directly by their names::
+
    >>> xxs.remove_namespaces()
    >>> xxs.select("//link")
    [<XmlXPathSelector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
     <XmlXPathSelector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
     ...
-    ]
+
+If you wonder why the namespace removal procedure is not always called, instead
+of having to call it manually. This is because of two reasons which, in order
+of relevance, are:
+
+1. removing namespaces requires to iterate and modify all nodes in the
+   document, which is a reasonably expensive operation to performs for all
+   documents crawled by Scrapy
+
+2. there could be some cases where using namespaces is actually required, in
+   case some element names clash between namespaces. These cases are very rare
+   though.

 .. _Google Base XML feed: http://base.google.com/support/bin/answer.py?hl=en&answer=59461