2009-04-10 05:35:53 +00:00
.. _intro-tutorial:
2009-01-29 12:29:28 +00:00
===============
Scrapy Tutorial
===============
2010-08-21 01:26:35 -03:00
In this tutorial, we'll assume that Scrapy is already installed on your system.
If that's not the case, see :ref: `intro-install` .
2009-01-29 12:29:28 +00:00
We are going to use `Open directory project (dmoz) <http://www.dmoz.org/> `_ as
2009-02-16 16:42:35 +00:00
our example domain to scrape.
2010-08-21 01:26:35 -03:00
This tutorial will walk you through these tasks:
2009-02-16 16:42:35 +00:00
2009-04-10 05:35:53 +00:00
1. Creating a new Scrapy project
2. Defining the Items you will extract
3. Writing a :ref: `spider <topics-spiders>` to crawl a site and extract
:ref: `Items <topics-items>`
4. Writing an :ref: `Item Pipeline <topics-item-pipeline>` to store the
extracted Items
2009-01-29 12:29:28 +00:00
2009-08-21 21:49:54 -03:00
Scrapy is written in Python_. If you're new to the language you might want to
start by getting an idea of what the language is like, to get the most out of
Scrapy. If you're already familiar with other languages, and want to learn
2013-02-14 11:09:40 -02:00
Python quickly, we recommend `Learn Python The Hard Way`_ . If you're new to programming
2009-08-21 21:49:54 -03:00
and want to start with Python, take a look at `this list of Python resources
for non-programmers`_.
2015-02-06 22:46:18 +05:30
.. _Python: https://www.python.org/
.. _this list of Python resources for non-programmers: https://wiki.python.org/moin/BeginnersGuide/NonProgrammers
2013-02-14 11:09:40 -02:00
.. _Learn Python The Hard Way: http://learnpythonthehardway.org/book/
2009-08-21 21:49:54 -03:00
2009-01-29 12:29:28 +00:00
Creating a project
==================
2015-04-17 20:55:02 +04:00
Before you start scraping, you will have to set up a new Scrapy project. Enter a
directory where you'd like to store your code and run::
2009-01-29 12:29:28 +00:00
2014-04-09 05:34:14 +06:00
scrapy startproject tutorial
2009-01-29 12:29:28 +00:00
2011-04-28 09:31:57 -03:00
This will create a `` tutorial `` directory with the following contents::
2009-01-29 12:29:28 +00:00
2014-04-09 05:34:14 +06:00
tutorial/
2015-04-20 21:09:03 -03:00
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
2014-04-09 05:34:14 +06:00
__init__.py
2015-04-20 21:09:03 -03:00
items.py # project items file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
2014-04-09 05:34:14 +06:00
__init__.py
...
2009-01-29 12:29:28 +00:00
2009-02-16 16:42:35 +00:00
Defining our Item
=================
2009-01-29 12:29:28 +00:00
2010-08-21 01:26:35 -03:00
`Items` are containers that will be loaded with the scraped data; they work
2015-04-20 21:09:03 -03:00
like simple Python dicts. While you can use plain Python dicts with Scrapy,
`Items` provide additional protection against populating undeclared fields,
preventing typos. They can also be used with :ref:`Item Loaders
<topics-loaders> `, a mechanism with helpers to conveniently populate ` Items`.
2009-02-06 16:20:49 +00:00
2014-04-09 05:34:14 +06:00
They are declared by creating a :class: `scrapy.Item <scrapy.item.Item>` class and defining
2015-04-17 20:55:02 +04:00
its attributes as :class: `scrapy.Field <scrapy.item.Field>` objects, much like in an ORM
2010-08-21 01:26:35 -03:00
(don't worry if you're not familiar with ORMs, you will see that this is an
2009-08-21 14:16:27 -03:00
easy task).
2015-04-17 20:55:02 +04:00
We begin by modeling the item that we will use to hold the site's data obtained
from dmoz.org. As we want to capture the name, url and description of the
2010-03-12 14:12:49 -02:00
sites, we define fields for each of these three attributes. To do that, we edit
2014-02-27 18:02:22 +00:00
`` items.py `` , found in the `` tutorial `` directory. Our Item class looks like this::
2009-02-06 16:20:49 +00:00
2014-04-09 05:34:14 +06:00
import scrapy
2009-01-29 12:29:28 +00:00
2014-04-09 05:34:14 +06:00
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
2014-02-01 23:31:14 +01:00
2009-08-21 14:16:27 -03:00
This may seem complicated at first, but defining the item allows you to use other handy
2015-04-17 20:55:02 +04:00
components of Scrapy that need to know what does your item look like.
2009-02-06 16:20:49 +00:00
2009-02-16 16:42:35 +00:00
Our first Spider
================
2009-01-29 12:29:28 +00:00
2015-04-20 21:09:03 -03:00
Spiders are classes that you define and Scrapy uses to scrape information from a
domain (or group of domains).
2009-01-29 12:29:28 +00:00
2009-04-10 05:35:53 +00:00
They define an initial list of URLs to download, how to follow links, and how
2015-04-17 20:55:02 +04:00
to parse the contents of pages to extract :ref: `items <topics-items>` .
2009-02-16 16:42:35 +00:00
2014-04-09 05:34:14 +06:00
To create a Spider, you must subclass :class: `scrapy.Spider <scrapy.spider.Spider>` and
2015-04-20 21:09:03 -03:00
define some attributes:
2009-02-16 16:42:35 +00:00
2013-12-28 00:47:32 +06:00
* :attr: `~scrapy.spider.Spider.name` : identifies the Spider. It must be
2010-04-01 18:27:22 -03:00
unique, that is, you can't set the same name for different Spiders.
2009-02-16 16:42:35 +00:00
2015-04-17 20:55:02 +04:00
* :attr: `~scrapy.spider.Spider.start_urls` : a list of URLs where the
Spider will begin to crawl from. The first pages downloaded will be those
2009-02-16 16:42:35 +00:00
listed here. The subsequent URLs will be generated successively from data
contained in the start URLs.
2015-04-18 19:48:25 +04:00
* :meth: `~scrapy.spider.Spider.parse` : a method of the spider, which will
2009-04-10 05:35:53 +00:00
be called with the downloaded :class: `~scrapy.http.Response` object of each
start URL. The response is passed to the method as the first and only
argument.
2014-02-01 23:31:14 +01:00
2009-04-10 05:35:53 +00:00
This method is responsible for parsing the response data and extracting
scraped data (as scraped items) and more URLs to follow.
2013-12-28 00:47:32 +06:00
The :meth: `~scrapy.spider.Spider.parse` method is in charge of processing
2009-08-19 21:39:58 -03:00
the response and returning scraped data (as :class: `~scrapy.item.Item`
2009-04-10 05:35:53 +00:00
objects) and more URLs to follow (as :class: `~scrapy.http.Request` objects).
2009-02-06 16:20:49 +00:00
2010-08-21 01:26:35 -03:00
This is the code for our first Spider; save it in a file named
2013-09-15 21:55:52 -07:00
`` dmoz_spider.py `` under the `` tutorial/spiders `` directory::
2009-01-30 14:26:40 +00:00
2014-04-09 05:34:14 +06:00
import scrapy
2009-01-29 12:29:28 +00:00
2014-04-09 05:34:14 +06:00
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
2014-02-01 23:31:14 +01:00
2014-04-09 05:34:14 +06:00
def parse(self, response):
2015-04-20 21:09:03 -03:00
filename = response.url.split("/")[-2] + '.html'
2014-04-09 05:34:14 +06:00
with open(filename, 'wb') as f:
f.write(response.body)
2009-01-29 12:29:28 +00:00
Crawling
2009-02-16 16:42:35 +00:00
--------
2009-01-29 12:29:28 +00:00
To put our spider to work, go to the project's top level directory and run::
2011-04-28 09:31:57 -03:00
scrapy crawl dmoz
2009-01-29 12:29:28 +00:00
2015-04-20 21:09:03 -03:00
This command runs the spider with name `` dmoz `` that we've just added, that
will send some requests for the `` dmoz.org `` domain. You will get an output
similar to this::
2009-01-29 12:29:28 +00:00
2014-01-23 18:18:56 -04:00
2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)
2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2015-03-19 15:02:54 -03:00
2014-01-23 18:13:07-0400 [scrapy] INFO: Spider opened
2014-01-23 18:13:08-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2014-01-23 18:13:09-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2014-01-23 18:13:09-0400 [scrapy] INFO: Closing spider (finished)
2011-04-28 09:31:57 -03:00
2009-01-29 12:29:28 +00:00
2015-04-20 21:09:03 -03:00
.. note ::
2015-03-19 15:02:54 -03:00
At the end you can see a log line for each URL defined in `` start_urls `` .
2015-04-20 21:09:03 -03:00
Because these URLs are the starting ones, they have no referrers, which is
shown at the end of the log line, where it says `` (referer: None) `` .
Now, check the files in the current directory. You should notice two new files
have been created: *Books.html* and *Resources.html* , with the content for the respective
URLs, as our `` parse `` method instructs.
2009-01-29 12:29:28 +00:00
2009-02-16 16:42:35 +00:00
What just happened under the hood?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2014-04-09 05:34:14 +06:00
Scrapy creates :class: `scrapy.Request <scrapy.http.Request>` objects
for each URL in the `` start_urls `` attribute of the Spider, and assigns
them the `` parse `` method of the spider as their callback function.
2009-02-16 16:42:35 +00:00
2014-04-14 19:47:26 -04:00
These Requests are scheduled, then executed, and :class: `scrapy.http.Response`
objects are returned and then fed back to the spider, through the
:meth: `~scrapy.spider.Spider.parse` method.
2009-02-16 16:42:35 +00:00
Extracting Items
----------------
Introduction to Selectors
^^^^^^^^^^^^^^^^^^^^^^^^^
2010-08-21 01:26:35 -03:00
There are several ways to extract data from web pages. Scrapy uses a mechanism
2013-10-14 16:31:20 -02:00
based on `XPath`_ or `CSS`_ expressions called :ref:`Scrapy Selectors
<topics-selectors> `. For more information about selectors and other extraction
mechanisms see the :ref: `Selectors documentation <topics-selectors>` .
2009-04-10 05:35:53 +00:00
.. _XPath: http://www.w3.org/TR/xpath
2013-10-14 16:31:20 -02:00
.. _CSS: http://www.w3.org/TR/selectors
2009-04-10 05:35:53 +00:00
Here are some examples of XPath expressions and their meanings:
* `` /html/head/title `` : selects the `` <title> `` element, inside the `` <head> ``
2015-04-17 20:55:02 +04:00
element of an HTML document
2009-04-10 05:35:53 +00:00
* `` /html/head/title/text() `` : selects the text inside the aforementioned
`` <title> `` element.
2009-02-16 16:42:35 +00:00
2009-04-10 05:35:53 +00:00
* `` //td `` : selects all the `` <td> `` elements
2009-02-16 16:42:35 +00:00
2009-04-10 05:35:53 +00:00
* `` //div[@class="mine"] `` : selects all `` div `` elements which contain an
attribute `` class="mine" ``
2009-02-16 16:42:35 +00:00
2009-04-10 05:35:53 +00:00
These are just a couple of simple examples of what you can do with XPath, but
2015-04-20 21:09:03 -03:00
XPath expressions are indeed much more powerful. To learn more about XPath, we
recommend `this tutorial to learn XPath through examples
<http://zvon.org/comp/r/tut-XPath_1.html> `_, and ` this tutorial to learn "how
to think in XPath" <http://plasmasturm.org/log/xpath101/>`_.
.. note :: **CSS vs XPath:** you can go a long way extracting data from web pages
using only CSS selectors. However, XPath offers more power because besides
navigating the structure, it can also look at the content: you're
able to select things like: *the link that contains the text 'Next Page'* .
Because of this, we encourage you to learn about XPath even if you
already know how to construct CSS selectors.
2009-02-16 16:42:35 +00:00
2015-04-21 10:57:44 -03:00
For working with CSS and XPath expressions, Scrapy provides
:class: `~scrapy.selector.Selector` class and convenient shortcuts to avoid
instantiating selectors yourself every time you need to select something from a
response.
2009-01-29 12:29:28 +00:00
2010-08-21 01:26:35 -03:00
You can see selectors as objects that represent nodes in the document
2014-02-27 18:02:22 +00:00
structure. So, the first instantiated selectors are associated with the root
2009-02-16 16:42:35 +00:00
node, or the entire document.
2013-10-14 16:31:20 -02:00
Selectors have four basic methods (click on the method to see the complete API
2014-02-27 18:02:22 +00:00
documentation):
2009-04-10 05:35:53 +00:00
2013-10-14 16:31:20 -02:00
* :meth: `~scrapy.selector.Selector.xpath` : returns a list of selectors, each of
2015-04-17 20:55:02 +04:00
which represents the nodes selected by the xpath expression given as
2013-10-14 16:31:20 -02:00
argument.
2013-12-02 13:24:12 -02:00
* :meth: `~scrapy.selector.Selector.css` : returns a list of selectors, each of
2015-04-17 20:55:02 +04:00
which represents the nodes selected by the CSS expression given as argument.
2009-04-10 05:35:53 +00:00
2013-10-14 16:31:20 -02:00
* :meth: `~scrapy.selector.Selector.extract` : returns a unicode string with the
selected data.
2009-04-10 05:35:53 +00:00
2013-10-14 16:31:20 -02:00
* :meth: `~scrapy.selector.Selector.re` : returns a list of unicode strings
2009-04-10 05:35:53 +00:00
extracted by applying the regular expression given as argument.
2009-01-29 12:29:28 +00:00
2009-02-16 16:42:35 +00:00
Trying Selectors in the Shell
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2009-04-10 05:35:53 +00:00
To illustrate the use of Selectors we're going to use the built-in :ref:`Scrapy
2015-04-18 13:00:34 +05:30
shell <topics-shell>`, which also requires `IPython <http://ipython.org/> `_ (an extended Python console)
2009-04-10 05:35:53 +00:00
installed on your system.
2009-01-29 12:29:28 +00:00
2010-08-21 01:26:35 -03:00
To start a shell, you must go to the project's top level directory and run::
2009-01-29 12:29:28 +00:00
2014-04-09 05:34:14 +06:00
scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
2013-09-18 18:01:28 -03:00
.. note ::
2015-04-18 19:48:25 +04:00
Remember to always enclose urls in quotes when running Scrapy shell from
2013-09-18 18:01:28 -03:00
command-line, otherwise urls containing arguments (ie. `` & `` character)
will not work.
2009-01-29 12:29:28 +00:00
This is what the shell looks like::
2010-08-20 01:33:02 -03:00
[ ... Scrapy log here ... ]
2015-03-19 15:02:54 -03:00
2014-01-23 17:11:42-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2010-08-20 11:26:14 -03:00
[s] Available Scrapy objects:
2014-01-23 18:04:57 -04:00
[s] crawler <scrapy.crawler.Crawler object at 0x3636b50>
[s] item {}
2010-08-20 11:26:14 -03:00
[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
2014-06-09 18:48:34 -03:00
[s] settings <scrapy.settings.Settings object at 0x3fadc50>
2014-01-23 18:04:57 -04:00
[s] spider <Spider 'default' at 0x3cebf50>
2010-08-20 11:26:14 -03:00
[s] Useful shortcuts:
2014-01-23 18:04:57 -04:00
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
2010-08-20 11:26:14 -03:00
[s] view(response) View response in a browser
2010-08-20 01:33:02 -03:00
2014-02-01 23:31:14 +01:00
In [1]:
2009-01-29 12:29:28 +00:00
2009-04-10 05:35:53 +00:00
After the shell loads, you will have the response fetched in a local
`` response `` variable, so if you type `` response.body `` you will see the body
2010-08-21 01:26:35 -03:00
of the response, or you can type `` response.headers `` to see its headers.
2009-01-29 12:29:28 +00:00
2015-04-17 20:55:02 +04:00
More importantly `` response `` has a `` selector `` attribute which is an instance of
:class: `~scrapy.selector.Selector` class, instantiated with this particular `` response `` .
You can run queries on `` response `` by calling `` response.selector.xpath() `` or
`` response.selector.css() `` . There are also some convenience shortcuts like `` response.xpath() ``
or `` response.xml() `` which map directly to `` response.selector.xpath() `` and
`` response.selector.css() `` .
2014-04-14 19:47:26 -04:00
2013-10-14 16:31:20 -02:00
So let's try it::
2009-01-29 12:29:28 +00:00
2014-04-14 19:47:26 -04:00
In [1]: response.xpath('//title')
2014-04-09 05:34:14 +06:00
Out[1]: [<Selector xpath='//title' data=u'<title> Open Directory - Computers: Progr'>]
2014-04-14 19:47:26 -04:00
In [2]: response.xpath('//title').extract()
2014-04-09 05:34:14 +06:00
Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']
2014-04-14 19:47:26 -04:00
In [3]: response.xpath('//title/text()')
2014-04-09 05:34:14 +06:00
Out[3]: [<Selector xpath='//title/text()' data=u'Open Directory - Computers: Programming:'> ]
2014-04-14 19:47:26 -04:00
In [4]: response.xpath('//title/text()').extract()
2014-04-09 05:34:14 +06:00
Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']
2014-04-14 19:47:26 -04:00
In [5]: response.xpath('//title/text()').re('(\w+):')
2014-04-09 05:34:14 +06:00
Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']
2009-01-29 12:29:28 +00:00
2009-04-10 05:35:53 +00:00
Extracting the data
^^^^^^^^^^^^^^^^^^^
2009-02-16 16:42:35 +00:00
2014-02-01 23:31:14 +01:00
Now, let's try to extract some real information from those pages.
2009-01-29 12:29:28 +00:00
2009-04-10 05:35:53 +00:00
You could type `` response.body `` in the console, and inspect the source code to
figure out the XPaths you need to use. However, inspecting the raw HTML code
2015-04-17 20:55:02 +04:00
there could become a very tedious task. To make it easier, you can
use Firefox Developer Tools or some Firefox extensions like Firebug. For more
information see :ref: `topics-firebug` and :ref: `topics-firefox` .
2009-01-29 12:29:28 +00:00
2015-04-17 20:55:02 +04:00
After inspecting the page source, you'll find that the web site's information
2009-04-10 05:35:53 +00:00
is inside a `` <ul> `` element, in fact the *second* `` <ul> `` element.
2015-04-17 20:55:02 +04:00
So we can select each `` <li> `` element belonging to the site's list with this
2009-04-10 05:35:53 +00:00
code::
2009-01-29 12:29:28 +00:00
2015-03-26 15:34:30 -04:00
response.xpath('//ul/li')
2009-01-29 12:29:28 +00:00
2015-04-17 20:55:02 +04:00
And from them, the site's descriptions::
2009-01-29 12:29:28 +00:00
2015-03-26 15:34:30 -04:00
response.xpath('//ul/li/text()').extract()
2009-01-29 12:29:28 +00:00
2015-04-17 20:55:02 +04:00
The site's titles::
2009-01-29 12:29:28 +00:00
2015-03-26 15:34:30 -04:00
response.xpath('//ul/li/a/text()').extract()
2009-01-29 12:29:28 +00:00
2015-04-17 20:55:02 +04:00
And the site's links::
2009-01-29 12:29:28 +00:00
2015-03-26 15:34:30 -04:00
response.xpath('//ul/li/a/@href').extract()
2009-01-29 12:29:28 +00:00
2014-02-27 18:02:22 +00:00
As we've said before, each `` .xpath() `` call returns a list of selectors, so we can
2013-10-14 16:31:20 -02:00
concatenate further `` .xpath() `` calls to dig deeper into a node. We are going to use
2009-01-29 12:29:28 +00:00
that property here, so::
2014-07-04 10:38:01 +02:00
for sel in response.xpath('//ul/li'):
2014-04-14 19:47:26 -04:00
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/@href').extract()
desc = sel.xpath('text()').extract()
2014-04-09 05:34:14 +06:00
print title, link, desc
2009-01-29 12:29:28 +00:00
2009-07-23 09:05:14 -03:00
.. note ::
2014-04-14 19:47:26 -04:00
For a more detailed description of using nested selectors, see
:ref: `topics-selectors-nesting-selectors` and
:ref: `topics-selectors-relative-xpaths` in the :ref: `topics-selectors`
documentation
2009-07-23 09:05:14 -03:00
2009-01-29 12:29:28 +00:00
Let's add this code to our spider::
2014-04-09 05:34:14 +06:00
import scrapy
2014-04-14 19:47:26 -04:00
2014-04-09 05:34:14 +06:00
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
2014-04-14 19:47:26 -04:00
2014-04-09 05:34:14 +06:00
def parse(self, response):
2014-04-14 19:47:26 -04:00
for sel in response.xpath('//ul/li'):
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/@href').extract()
desc = sel.xpath('text()').extract()
2014-04-09 05:34:14 +06:00
print title, link, desc
2015-04-17 20:55:02 +04:00
Now try crawling dmoz.org again and you'll see sites being printed
in your output. Run::
2009-01-29 12:29:28 +00:00
2014-04-09 05:34:14 +06:00
scrapy crawl dmoz
2009-01-29 12:29:28 +00:00
2009-08-21 14:16:27 -03:00
Using our item
--------------
2015-04-21 11:30:48 -03:00
:class: `~scrapy.item.Item` objects are custom Python dicts; you can access the
2010-08-21 01:26:35 -03:00
values of their fields (attributes of the class we defined earlier) using the
2009-08-21 14:16:27 -03:00
standard dict syntax like::
2014-04-09 05:34:14 +06:00
>>> item = DmozItem()
>>> item['title'] = 'Example title'
>>> item['title']
'Example title'
2009-08-21 14:16:27 -03:00
2015-04-21 11:30:48 -03:00
So, in order to return the data we've scraped so far, the final code for our
Spider would be like this::
2009-01-29 14:35:54 +00:00
2014-04-09 05:34:14 +06:00
import scrapy
2014-02-01 23:31:14 +01:00
from tutorial.items import DmozItem
2014-04-09 05:34:14 +06:00
class DmozSpider(scrapy.Spider):
2014-02-01 23:31:14 +01:00
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
2014-04-14 19:47:26 -04:00
for sel in response.xpath('//ul/li'):
2014-02-01 23:31:14 +01:00
item = DmozItem()
2014-04-14 19:47:26 -04:00
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
2009-01-29 14:35:54 +00:00
2011-04-28 02:28:39 -03:00
.. note :: You can find a fully-functional variant of this spider in the dirbot_
project available at https://github.com/scrapy/dirbot
2015-04-17 20:55:02 +04:00
Now crawling dmoz.org yields `` DmozItem `` objects::
2009-01-29 14:35:54 +00:00
2015-03-19 15:02:54 -03:00
[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
2011-06-03 01:13:01 -03:00
{'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],
'link': [u'http://gnosis.cx/TPiP/'],
'title': [u'Text Processing in Python']}
2015-03-19 15:02:54 -03:00
[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
2011-06-03 01:13:01 -03:00
{'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],
'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],
'title': [u'XML Processing with Python']}
2009-01-29 14:35:54 +00:00
2015-04-20 21:09:03 -03:00
Following links
===============
Let's say, instead of just scraping the stuff in *Books* and *Resources* pages,
you want everything that is under the `Python directory
<http://www.dmoz.org/Computers/Programming/Languages/Python/> `_.
Now that you know how to extract data from a page, why not extract the links
for the pages you are interested, follow them and then extract the data you
want for all of them?
Here is a modification to our spider that does just that::
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/",
]
def parse(self, response):
for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
2015-04-21 10:57:44 -03:00
url = response.urljoin(href.extract())
2015-04-20 21:09:03 -03:00
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
2015-04-21 10:57:44 -03:00
Now the `parse()` method only extract the interesting links from the page,
builds a full absolute URL using the `response.urljoin` method (since the links can
be relative) and yields new requests to be sent later, registering as callback
the method `parse_dir_contents()` that will ultimately scrape the data we want.
2015-04-20 21:09:03 -03:00
2015-04-21 10:57:44 -03:00
What you see here is the Scrapy's mechanism of following links: when you yield
a Request in a callback method, Scrapy will schedule that request to be sent
and register a callback method to be executed when that request finishes.
2015-04-20 21:09:03 -03:00
Using this, you can build complex crawlers that follow links according to rules
you define, and extract different kinds of data depending on the page it's
visiting.
2015-04-21 10:57:44 -03:00
A common pattern is a callback method that extract some items, looks for a link
to follow to the next page and then yields a `Request` with the same callback
for it::
def parse_articles_follow_next_page(self, response):
for article in response.xpath("//article"):
2015-04-21 11:30:48 -03:00
item = ArticleItem()
... extract article data here
yield item
2015-04-21 10:57:44 -03:00
2015-04-21 11:19:10 -03:00
next_page = response.css("ul.navigation > li.next-page > a::attr('href')")
2015-04-21 10:57:44 -03:00
if next_page:
url = response.urljoin(next_page[0].extract())
yield Request(url, self.parse_articles_follow_next_page)
This creates a sort of loop, following all the links to the next page until it
doesn't find one -- handy for crawling blogs, forums and other sites with
pagination.
Another common pattern is to build an item with data from more than one page,
using a `trick to pass additional data to the callbacks
<topics-request-response-ref-request-callback-arguments> `_.
2015-04-20 21:09:03 -03:00
.. note ::
2015-04-21 10:57:44 -03:00
As an example spider that leverages this mechanism, check out the
2015-04-21 13:20:08 -03:00
:class: `~scrapy.spiders.CrawlSpider` class for a generic spider
2015-04-20 21:09:03 -03:00
that implements a small rules engine that you can use to write your
crawlers on top of it.
2010-10-10 20:31:05 -02:00
Storing the scraped data
========================
2009-01-29 14:35:54 +00:00
2015-04-17 20:55:02 +04:00
The simplest way to store the scraped data is by using :ref:`Feed exports
2010-10-10 20:31:05 -02:00
<topics-feed-exports> `, with the following command::
2009-01-29 17:57:21 +00:00
2014-06-25 13:55:15 -03:00
scrapy crawl dmoz -o items.json
2009-01-29 17:57:21 +00:00
2015-04-17 20:55:02 +04:00
That will generate an `` items.json `` file containing all scraped items,
2010-10-10 20:31:05 -02:00
serialized in `JSON`_ .
2009-01-29 17:57:21 +00:00
2010-10-10 20:31:05 -02:00
In small projects (like the one in this tutorial), that should be enough.
However, if you want to perform more complex things with the scraped items, you
can write an :ref: `Item Pipeline <topics-item-pipeline>` . As with Items, a
placeholder file for Item Pipelines has been set up for you when the project is
2011-04-28 09:31:57 -03:00
created, in `` tutorial/pipelines.py `` . Though you don't need to implement any item
2014-02-27 18:02:22 +00:00
pipelines if you just want to store the scraped items.
2009-05-20 00:57:44 -03:00
2011-04-28 02:28:39 -03:00
Next steps
==========
2014-02-01 23:31:14 +01:00
2015-04-17 20:55:02 +04:00
This tutorial covered only the basics of Scrapy, but there's a lot of other
2012-04-19 02:37:22 -03:00
features not mentioned here. Check the :ref: `topics-whatelse` section in
:ref: `intro-overview` chapter for a quick overview of the most important ones.
Then, we recommend you continue by playing with an example project (see
:ref: `intro-examples` ), and then continue with the section
2011-04-28 02:28:39 -03:00
:ref: `section-basics` .
2010-10-10 20:31:05 -02:00
.. _JSON: http://en.wikipedia.org/wiki/JSON
2011-04-28 02:28:39 -03:00
.. _dirbot: https://github.com/scrapy/dirbot