some improvements to selectors doc structure, added literalincludes for sample1.html (to avoid duplicating the content), renamed that file and moved to _static (so it appears on built doc), moved comments out of source code snippets and into documentation text, and splitted them. converted to required '>>>' console format, and added proper highlighting hints

--HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40646
2025-02-24 18:24:00 +00:00 · 2009-01-05 02:49:23 +00:00 · 2009-01-05 02:49:23 +00:00 · 5c9c82d055
commit 5c9c82d055
parent 7ca989a781
2 changed files with 44 additions and 47 deletions
--- a/scrapy/trunk/docs/_static/selectors-sample1.html
+++ b/scrapy/trunk/docs/_static/selectors-sample1.html
--- a/scrapy/trunk/docs/topics/selectors.rst
+++ b/scrapy/trunk/docs/topics/selectors.rst
@ -4,16 +4,20 @@
 Selectors
 =========

-Selectors are *the* way you have to extract information from documents. They retrieve information from the response's body, given an XPath, or a Regular Expression that you provide.
+Selectors are the recommended way to extract information from documents. They retrieve information from the response's body, given an XPath, or a Regular Expression that you provide.
+
+.. highlight:: python

 Currently there are two kinds of selectors, HtmlXPathSelectors, and XmlXPathSelectors. Both work in the same way; they are first instanciated with a response, for example::

    hxs = HtmlXPathSelector(response) # an HTML selector
    xxs = XmlXPathSelector(response) # an XML selector

+.. highlight:: sh
+
 Now, before going on with selectors, I'd suggest you to open a Scrapy shell, which you can use by calling your project manager with the 'shell' argument; something like::

-    [user@host ~/myproject]$ ./scrapy-ctl.py shell <url>
+    $ ./scrapy-ctl.py shell <url>

 Notice that you'll have to install IPython in order to use this feature, but believe me that it worths it; the shell is **very** useful.

@ -21,30 +25,22 @@ With the shell you can simulate parsing a webpage, either by calling "scrapy-ctl
 to retreive the given url, and fills in the 'response' variable with the result.

 Ok, so now let's use the shell to show you a bit how do selectors work.
-We'll use an example page located in Scrapy's site (http://www.scrapy.org/docs/topics/sample1.htm), whose markup is::

-    <html>
-     <head>
-      <base href='http://example.com/' />
-      <title>Example website</title>
-     </head>
-     <body>
-      <div id='images'>
-        <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
-        <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
-        <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
-        <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
-        <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
-      </div>
-     </body>
-    </html>
+We'll use an example page located in Scrapy's (here's a `direct link <../_static/selectors-sample1.html>`_ if you want to download it), whose markup is:
+
+.. literalinclude:: ../_static/selectors-sample1.html
+   :language: html
+
+.. highlight:: sh

 First, we open the shell::

-    [user@host ~/myproject]$ ./scrapy-ctl.py shell 'http://www.scrapy.org/docs/topics/sample1.htm'
+    $ ./scrapy-ctl.py shell 'http://www.scrapy.org/docs/topics/sample1.htm'

 Then, after the shell loads, you'll have some already-made objects for you to play with. Two of them, hxs and xxs, are selectors.

+.. highlight:: python
+
 You could instanciate your own by doing::

    from scrapy.xpath.selector import HtmlXPathSelector, XmlXPathSelector
@ -55,41 +51,40 @@ Where 'response' is the object that Scrapy already created for you containing th

 But anyway, we'll stick to the selectors that Scrapy already made for us, and more specifically, the HtmlXPathSelector (since we're working with an HTML document right now).

-Let's try some expressions::
+Let's try extracting the title::

-    # The title
-    In [1]: hxs.x('//title/text()')
-    Out[1]: [<HtmlXPathSelector (text) xpath=//title/text()>]
-    # As you can see, the x method returns an XPathSelectorList, which is actually a list of selectors.
-    # To extract their data you must use the extract() method, as follows:
-    In [2]: hxs.x('//title/text()').extract()
-    Out[2]: [u'Example website']
+    >>> hxs.x('//title/text()')
+    [<HtmlXPathSelector (text) xpath=//title/text()>]

-    # The base url
-    In [3]: hxs.x('//base/@href').extract()
-    Out[3]: [u'http://example.com/']
+As you can see, the x method returns an XPathSelectorList, which is actually a list of selectors.
+To extract their data you must use the extract() method, as follows::

-    # Image links
-    In [4]: hxs.x('//a[contains(@href, "image")]/@href').extract()
-    Out[4]:
+    >>> hxs.x('//title/text()').extract()
+    [u'Example website']
+
+Let's know extract the base URL and some image links::
+
+    >>> hxs.x('//base/@href').extract()
+    [u'http://example.com/']
+
+    >>> hxs.x('//a[contains(@href, "image")]/@href').extract()
    [u'image1.html',
     u'image2.html',
     u'image3.html',
     u'image4.html',
     u'image5.html']

-    # Image thumbnails
-    In [5]: hxs.x('//a[contains(@href, "image")]/img/@src').extract()
-    Out[5]:
+    >>> hxs.x('//a[contains(@href, "image")]/img/@src').extract()
    [u'image1_thumb.jpg',
     u'image2_thumb.jpg',
     u'image3_thumb.jpg',
     u'image4_thumb.jpg',
     u'image5_thumb.jpg']

-    # Image names
-    In [6]: hxs.x('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
-    Out[6]:
+And here's an example which shows the `re()` method of xpath selectors which
+allows you to use regular expressions to select parts.
+
+    >>> hxs.x('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
    [u'My image 1',
     u'My image 2',
     u'My image 3',
@ -97,22 +92,24 @@ Let's try some expressions::
     u'My image 5']


-Ok, let's explain a bit.
-Selector's x() method, is intended to select a node or an attribute from the document, given an XPath expression, as you could see upwards.
-You can apply an x() call to any node you have, which means that you can join different calls, for example:::
+Now let's explain a bit what we just did.

-    In [10]: links = hxs.x('//a[contains(@href, "image")]')
+Selector's x() method, is intended to select a node or an attribute from the
+document, given an XPath expression, as you could see upwards.

-    In [11]: links.extract()
-    Out[11]:
+You can apply an x() call to any node you have, which means that you can join
+different calls, for example:::
+
+    >>> links = hxs.x('//a[contains(@href, "image")]')
+    >>> links.extract()
    [u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
     u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
     u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
     u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
     u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

-    In [12]: for index, link in enumerate(links):
-                print 'Link number %d points to url %s and image %s' % (index, link.x('@href').extract(), link.x('img/@src').extract())
+    >>> for index, link in enumerate(links):
+            print 'Link number %d points to url %s and image %s' % (index, link.x('@href').extract(), link.x('img/@src').extract())

    Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
    Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']