`` element.
* ``//td``: selects all the ```` elements
* ``//div[@class="mine"]``: selects all ``div`` elements which contain an
attribute ``class="mine"``
These are just a couple of simple examples of what you can do with XPath, but
XPath expressions are indeed much more powerful. To learn more about XPath we
recommend `this XPath tutorial `_.
For working with XPaths, Scrapy provides :class:`~scrapy.selector.Selector`
class and convenient shortcuts to avoid instantiating selectors yourself
everytime you need to select something from a response.
You can see selectors as objects that represent nodes in the document
structure. So, the first instantiated selectors are associated with the root
node, or the entire document.
Selectors have four basic methods (click on the method to see the complete API
documentation):
* :meth:`~scrapy.selector.Selector.xpath`: returns a list of selectors, each of
them representing the nodes selected by the xpath expression given as
argument.
* :meth:`~scrapy.selector.Selector.css`: returns a list of selectors, each of
them representing the nodes selected by the CSS expression given as argument.
* :meth:`~scrapy.selector.Selector.extract`: returns a unicode string with the
selected data.
* :meth:`~scrapy.selector.Selector.re`: returns a list of unicode strings
extracted by applying the regular expression given as argument.
Trying Selectors in the Shell
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To illustrate the use of Selectors we're going to use the built-in :ref:`Scrapy
shell `, which also requires IPython (an extended Python console)
installed on your system.
To start a shell, you must go to the project's top level directory and run::
scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
.. note::
Remember to always enclose urls with quotes when running Scrapy shell from
command-line, otherwise urls containing arguments (ie. ``&`` character)
will not work.
This is what the shell looks like::
[ ... Scrapy log here ... ]
2014-01-23 17:11:42-0400 [default] DEBUG: Crawled (200) (referer: None)
[s] Available Scrapy objects:
[s] crawler
[s] item {}
[s] request
[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] settings
[s] spider
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]:
After the shell loads, you will have the response fetched in a local
``response`` variable, so if you type ``response.body`` you will see the body
of the response, or you can type ``response.headers`` to see its headers.
More important, if you type ``response.selector`` you will access a selector
object you can use to query the response, and convenient shortcuts like
``response.xpath()`` and ``response.css()`` mapping to
``response.selector.xpath()`` and ``response.selector.css()``
So let's try it::
In [1]: response.xpath('//title')
Out[1]: [Open Directory - Computers: Progr'>]
In [2]: response.xpath('//title').extract()
Out[2]: [u'Open Directory - Computers: Programming: Languages: Python: Books']
In [3]: response.xpath('//title/text()')
Out[3]: []
In [4]: response.xpath('//title/text()').extract()
Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']
In [5]: response.xpath('//title/text()').re('(\w+):')
Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']
Extracting the data
^^^^^^^^^^^^^^^^^^^
Now, let's try to extract some real information from those pages.
You could type ``response.body`` in the console, and inspect the source code to
figure out the XPaths you need to use. However, inspecting the raw HTML code
there could become a very tedious task. To make this an easier task, you can
use some Firefox extensions like Firebug. For more information see
:ref:`topics-firebug` and :ref:`topics-firefox`.
After inspecting the page source, you'll find that the web sites information
is inside a ```` element, in fact the *second* ```` element.
So we can select each ``- `` element belonging to the sites list with this
code::
sel.xpath('//ul/li')
And from them, the sites descriptions::
sel.xpath('//ul/li/text()').extract()
The sites titles::
sel.xpath('//ul/li/a/text()').extract()
And the sites links::
sel.xpath('//ul/li/a/@href').extract()
As we've said before, each ``.xpath()`` call returns a list of selectors, so we can
concatenate further ``.xpath()`` calls to dig deeper into a node. We are going to use
that property here, so::
for sel in response.xpath('//ul/li')
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/@href').extract()
desc = sel.xpath('text()').extract()
print title, link, desc
.. note::
For a more detailed description of using nested selectors, see
:ref:`topics-selectors-nesting-selectors` and
:ref:`topics-selectors-relative-xpaths` in the :ref:`topics-selectors`
documentation
Let's add this code to our spider::
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/@href').extract()
desc = sel.xpath('text()').extract()
print title, link, desc
Now try crawling the dmoz.org domain again and you'll see sites being printed
in your output, run::
scrapy crawl dmoz
Using our item
--------------
:class:`~scrapy.item.Item` objects are custom python dicts; you can access the
values of their fields (attributes of the class we defined earlier) using the
standard dict syntax like::
>>> item = DmozItem()
>>> item['title'] = 'Example title'
>>> item['title']
'Example title'
Spiders are expected to return their scraped data inside
:class:`~scrapy.item.Item` objects. So, in order to return the data we've
scraped so far, the final code for our Spider would be like this::
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
.. note:: You can find a fully-functional variant of this spider in the dirbot_
project available at https://github.com/scrapy/dirbot
Now doing a crawl on the dmoz.org domain yields ``DmozItem`` objects::
[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],
'link': [u'http://gnosis.cx/TPiP/'],
'title': [u'Text Processing in Python']}
[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],
'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],
'title': [u'XML Processing with Python']}
Storing the scraped data
========================
The simplest way to store the scraped data is by using the :ref:`Feed exports
`, with the following command::
scrapy crawl dmoz -o items.json -t json
That will generate a ``items.json`` file containing all scraped items,
serialized in `JSON`_.
In small projects (like the one in this tutorial), that should be enough.
However, if you want to perform more complex things with the scraped items, you
can write an :ref:`Item Pipeline `. As with Items, a
placeholder file for Item Pipelines has been set up for you when the project is
created, in ``tutorial/pipelines.py``. Though you don't need to implement any item
pipelines if you just want to store the scraped items.
Next steps
==========
This tutorial covers only the basics of Scrapy, but there's a lot of other
features not mentioned here. Check the :ref:`topics-whatelse` section in
:ref:`intro-overview` chapter for a quick overview of the most important ones.
Then, we recommend you continue by playing with an example project (see
:ref:`intro-examples`), and then continue with the section
:ref:`section-basics`.
.. _JSON: http://en.wikipedia.org/wiki/JSON
.. _dirbot: https://github.com/scrapy/dirbot
|