`` element.
* ``//td``: selects all the ```` elements
* ``//div[@class="mine"]``: selects all ``div`` elements which contain an
attribute ``class="mine"``
These are just a couple of simple examples of what you can do with XPath, but
XPath expression are indeed much more powerful. To learn more about XPath we
recommend `this XPath tutorial `_.
For working with XPaths, Scrapy provides a :class:`~scrapy.selector.XPathSelector`
class, which comes in two flavours, :class:`~scrapy.selector.HtmlXPatSelector`
(for HTML data) and :class:`~scrapy.selector.XmlXPathSelector` (for XML data). In
order to use them you must instantiate the desired class with a
:class:`~scrapy.http.Response` object.
You can see selectors as objects that represents nodes in the document
structure. So, the first instantiated selectors are associated to the root
node, or the entire document.
Selectors have three methods (click on the method to see the complete API
documentation).
* :meth:`~scrapy.selector.XPathSelector.x`: returns a list of selectors, each of
them representing the nodes selected by the xpath expression given as
argument.
* :meth:`~scrapy.selector.XPathSelector.extract`: returns a unicode string with
the data selected by the XPath selector.
* :meth:`~scrapy.selector.XPathSelector.re`: returns a list unicode strings
extracted by applying the regular expression given as argument.
Trying Selectors in the Shell
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To illustrate the use of Selectors we're going to use the built-in :ref:`Scrapy
shell `, which also requires IPython (an extended Python console)
installed on your system.
To start a shell you must go to the project's top level directory and run::
python scrapy-ctl.py shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
This is what the shell looks like::
[-] Log opened.
Welcome to Scrapy shell!
Fetching ...
------------------------------------------------------------------------------
Available Scrapy variables:
xxs:
url: http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
spider:
hxs:
item:
response:
Available commands:
get [url]: Fetch a new URL or re-fetch current Request
shelp: Prints this help.
------------------------------------------------------------------------------
Python 2.6.1 (r261:67515, Dec 7 2008, 08:27:41)
Type "copyright", "credits" or "license" for more information.
IPython 0.9.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object'. ?object also works, ?? prints more.
In [1]:
After the shell loads, you will have the response fetched in a local
``response`` variable, so if you type ``response.body`` you will see the body
of the response, or you can ``response.headers`` to see its headers.
The shell also instantiates two selectors, one for HTML (in the ``hxs``
variable) and one for XML (in the ``xxs`` variable)with this response. So let's
try them::
In [1]: hxs.select('/html/head/title')
Out[1]: []
In [2]: hxs.select('/html/head/title').extract()
Out[2]: [u'Open Directory - Computers: Programming: Languages: Python: Books']
In [3]: hxs.select('/html/head/title/text()')
Out[3]: []
In [4]: hxs.select('/html/head/title/text()').extract()
Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']
In [5]: hxs.select('/html/head/title/text()').re('(\w+):')
Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']
Extracting the data
^^^^^^^^^^^^^^^^^^^
Now, let's try to extract some real information from those pages.
You could type ``response.body`` in the console, and inspect the source code to
figure out the XPaths you need to use. However, inspecting the raw HTML code
there could become a very tedious task. To make this an easier task, you can
use some Firefox extensions like Firebug. For more information see
:ref:`topics-firebug` and :ref:`topics-firefox`.
After inspecting the page source you'll find that the web sites information
is inside a ```` element, in fact the *second* ```` element.
So we can select each ``- `` element belonging to the sites list with this
code::
hxs.select('//ul[2]/li')
And from them, the sites descriptions::
hxs.select('//ul[2]/li/text()').extract()
The sites titles::
hxs.select('//ul[2]/li/a/text()').extract()
And the sites links::
hxs.select('//ul[2]/li/a/@href').extract()
As we said before, each ``select()`` call returns a list of selectors, so we can
concatenate further ``select()`` calls to dig deeper into a node. We are going to use
that property here, so::
sites = hxs.select('//ul[2]/li')
for site in sites:
title = site.select('a/text()').extract()
link = site.select('a/@href').extract()
desc = site.select('text()').extract()
print title, link, desc
.. note::
For a more detailed description of using nested selectors see
:ref:`topics-selectors-nesting-selectors` and
:ref:`topics-selectors-relative-xpaths` in :ref:`topics-selectors`
documentation
Let's add this code to our spider::
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class DmozSpider(BaseSpider):
name = "dmoz.org"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul[2]/li')
for site in sites:
title = site.select('a/text()').extract()
link = site.select('a/@href').extract()
desc = site.select('text()').extract()
print title, link, desc
SPIDER = DmozSpider()
Now try crawling the dmoz.org domain again and you'll see sites being printed
in your output, run::
python scrapy-ctl.py crawl dmoz.org
Using our item
--------------
:class:`~scrapy.item.Item` objects are custom python dict, you can access the
values oftheir fields (attributes of the class we defined earlier) using the
standard dict syntax like::
>>> item = DmozItem()
>>> item['title'] = 'Example title'
>>> item['title']
'Example title'
Spiders are expected to return their scraped data inside
:class:`~scrapy.item.Item` objects, so to actually return the data we've
scraped so far, the code for our Spider should be like this::
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from dmoz.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz.org"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul[2]/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/@href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
SPIDER = DmozSpider()
Now doing a crawl on the dmoz.org domain yields ``DmozItem``'s::
[dmoz.org] DEBUG: Scraped DmozItem(desc=[u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.]\n'], link=[u'http://gnosis.cx/TPiP/'], title=[u'Text Processing in Python']) in
[dmoz.org] DEBUG: Scraped DmozItem(desc=[u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'], link=[u'http://www.informit.com/store/product.aspx?isbn=0130211192'], title=[u'XML Processing with Python']) in
Storing the data (using an Item Pipeline)
=========================================
After an item has been scraped by a Spider, it is sent to the :ref:`Item
Pipeline `.
The Item Pipeline is a group of user written Python classes that implement a
simple method. They receive an Item and perform an action over it (for example:
validation, checking for duplicates, or storing it in a database), and then
decide if the Item continues through the Pipeline or it's dropped and no longer
processed.
In small projects (like the one on this tutorial) we will use only one Item
Pipeline that just stores our Items.
As with Items, a Pipeline placeholder has been set up for you in the project
creation step, it's in ``dmoz/pipelines.py`` and looks like this::
# Define your item pipelines here
class DmozPipeline(object):
def process_item(self, spider, item):
return item
We have to override the ``process_item`` method in order to store our Items
somewhere.
Here's a simple pipeline for storing the scraped items into a CSV (comma
separated values) file using the standard library `csv module`_::
import csv
class CsvWriterPipeline(object):
def __init__(self):
self.csvwriter = csv.writer(open('items.csv', 'wb'))
def process_item(self, spider, item):
self.csvwriter.writerow([item['title'][0], item['link'][0], item['desc'][0]])
return item
.. _csv module: http://docs.python.org/library/csv.html
Don't forget to enable the pipeline by adding it to the
:setting:`ITEM_PIPELINES` setting in your settings.py, like this::
ITEM_PIPELINES = ['dmoz.pipelines.CsvWriterPipeline']
Finale
======
This tutorial covers only the basics of Scrapy, but there's a lot of other
features not mentioned here. We recommend you continue reading the section
:ref:`topics-index`.
|