2009-01-07 03:58:15 +00:00
.. _intro-overview:
2008-12-26 12:21:53 +00:00
2009-01-07 03:58:15 +00:00
==================
Scrapy at a glance
==================
2008-12-26 12:21:53 +00:00
2009-01-07 03:58:15 +00:00
Scrapy a is an application framework for crawling web sites and extracting
structured data which can be used for a wide range of useful applications, like
data mining, information processing or historical archival.
2008-12-26 12:21:53 +00:00
2010-03-20 20:24:18 -03:00
Even though Scrapy was originally designed for `screen scraping`_ (more
precisely, `web scraping`_ ), it can also be used to extract data using APIs
(such as `Amazon Associates Web Services`_ ) or as a general purpose web
crawler.
2008-12-26 12:21:53 +00:00
2009-01-07 03:58:15 +00:00
.. _screen scraping: http://en.wikipedia.org/wiki/Screen_scraping
2010-03-20 20:24:18 -03:00
.. _web scraping: http://en.wikipedia.org/wiki/Web_scraping
2009-01-07 03:58:15 +00:00
.. _Amazon Associates Web Services: http://aws.amazon.com/associates/
2008-12-26 12:21:53 +00:00
2009-01-07 03:58:15 +00:00
The purpose of this document is to introduce you to the concepts behind Scrapy
so you can get an idea of how it works and decide if Scrapy is what you need.
2008-12-26 12:21:53 +00:00
2009-01-07 03:58:15 +00:00
When you're ready to start a project, you can :ref:`start with the tutorial
2009-08-21 21:49:54 -03:00
<intro-tutorial> `.
2008-12-26 12:21:53 +00:00
2009-01-07 03:58:15 +00:00
Pick a website
==============
2008-12-26 12:21:53 +00:00
2009-01-07 03:58:15 +00:00
So you need to extract some information from a website, but the website doesn't
provide any API or mechanism to access that info from a computer program.
Scrapy can help you extract that information. Let's say we want to extract
information about all torrent files added today in the `mininova`_ torrent
site.
2008-12-26 12:21:53 +00:00
2009-01-07 03:58:15 +00:00
.. _mininova: http://www.mininova.org
The list of all torrents added today can be found in this page:
http://www.mininova.org/today
Write a Spider to extract the Items
===================================
Now we'll write a Spider which defines the start URL
2009-01-07 12:56:40 +00:00
(http://www.mininova.org/today), the rules for following links and the rules
for extracting the data from pages.
2009-01-07 03:58:15 +00:00
If we take a look at that page content we'll see that all torrent URLs are like
2009-01-07 12:56:40 +00:00
http://www.mininova.org/tor/NUMBER where `` NUMBER `` is an integer. We'll use
that to construct the regular expression for the links to follow: `` /tor/\d+ `` .
2009-01-07 03:58:15 +00:00
For extracting data we'll use `XPath`_ to select the part of the document where
the data is to be extracted. Let's take one of those torrent pages:
2009-08-24 15:11:04 -03:00
http://www.mininova.org/tor/2657665
2009-01-07 03:58:15 +00:00
.. _XPath: http://www.w3.org/TR/xpath
And look at the page HTML source to construct the XPath to select the data we
want to extract which is: torrent name, description and size.
.. highlight :: html
By looking at the page HTML source we can see that the file name is contained
inside a `` <h1> `` tag::
2009-08-24 15:11:04 -03:00
<h1>Home[2009][Eng]XviD-ovd</h1>
2009-01-07 03:58:15 +00:00
.. highlight :: none
An XPath expression to extract the name could be::
//h1/text()
.. highlight :: html
And the description is contained inside a `` <div> `` tag with `` id="description" `` ::
2009-08-24 15:11:04 -03:00
<h2>Description:</h2>
2009-01-07 03:58:15 +00:00
2009-08-24 15:11:04 -03:00
<div id="description">
"HOME" - a documentary film by Yann Arthus-Bertrand
<br/>
<br/>
***
<br/>
<br/>
"We are living in exceptional times. Scientists tell us that we have 10 years to change the way we live, avert the depletion of natural resources and the catastrophic evolution of the Earth's climate.
2009-01-07 03:58:15 +00:00
2009-08-24 15:11:04 -03:00
...
2009-01-07 03:58:15 +00:00
.. highlight :: none
An XPath expression to select the description could be::
//div[@id='description']
.. highlight :: html
Finally, the file size is contained in the second `` <p> `` tag inside the `` <div> ``
2009-08-24 15:11:04 -03:00
tag with `` id=specifications `` ::
2009-01-07 03:58:15 +00:00
2009-08-24 15:11:04 -03:00
<div id="specifications">
2009-01-07 03:58:15 +00:00
<p>
<strong>Category:</strong>
2009-08-24 15:11:04 -03:00
<a href="/cat/4">Movies</a> > <a href="/sub/35">Documentary</a>
2009-01-07 03:58:15 +00:00
</p>
<p>
<strong>Total size:</strong>
2009-08-24 15:11:04 -03:00
699.79 megabyte</p>
2009-01-07 03:58:15 +00:00
.. highlight :: none
An XPath expression to select the description could be::
2009-08-24 15:11:04 -03:00
//div[@id='specifications']/p[2]/text()[2]
2009-01-07 03:58:15 +00:00
.. highlight :: python
For more information about XPath see the `XPath reference`_ .
.. _XPath reference: http://www.w3.org/TR/xpath
Finally, here's the spider code::
class MininovaSpider(CrawlSpider):
2010-04-01 18:27:22 -03:00
name = 'mininova.org'
allowed_domains = ['mininova.org']
2009-01-07 03:58:15 +00:00
start_urls = ['http://www.mininova.org/today']
2009-05-18 19:19:37 -03:00
rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]
2009-01-07 03:58:15 +00:00
def parse_torrent(self, response):
x = HtmlXPathSelector(response)
2009-02-12 20:58:42 +00:00
2009-08-19 21:39:58 -03:00
torrent = TorrentItem()
torrent['url'] = response.url
torrent['name'] = x.select("//h1/text()").extract()
torrent['description'] = x.select("//div[@id='description']").extract()
torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
2009-09-22 11:25:40 -03:00
return torrent
2009-01-07 03:58:15 +00:00
For brevity sake, we intentionally left out the import statements and the
Torrent class definition (which is included some paragraphs above).
Write a pipeline to store the items extracted
=============================================
Now let's write an :ref: `topics-item-pipeline` that serializes and stores the
extracted item into a file using `pickle`_ ::
import pickle
class StoreItemPipeline(object):
2010-08-12 10:48:37 -03:00
def process_item(self, item, spider):
2009-08-19 21:39:58 -03:00
torrent_id = item['url'].split('/')[-1]
2009-08-09 17:08:42 -03:00
f = open("torrent-%s.pickle" % torrent_id, "w")
2009-01-07 03:58:15 +00:00
pickle.dump(item, f)
f.close()
.. _pickle: http://docs.python.org/library/pickle.html
2009-01-07 14:02:28 +00:00
What else?
2009-01-07 03:58:15 +00:00
==========
You've seen how to extract and store items from a website using Scrapy, but
this is just the surface. Scrapy provides a lot of powerful features for making
2009-01-07 14:02:28 +00:00
scraping easy and efficient, such as:
2010-03-20 20:24:18 -03:00
* Built-in support for :ref: `selecting and extracting <topics-selectors>` data
from HTML and XML sources
2009-01-07 14:02:28 +00:00
2010-08-17 14:27:48 -03:00
* Built-in support for :ref: `generating feed exports <topics-feed-exports>` in
multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP,
S3, filesystem)
2009-01-07 14:02:28 +00:00
2010-03-20 20:24:18 -03:00
* A media pipeline for :ref: `automatically downloading images <topics-images>`
(or any other media) associated with the scraped items
* Support for :ref: `extending Scrapy <extending-scrapy>` by plugging
your own functionality using middlewares, extensions, and pipelines
2009-01-07 14:02:28 +00:00
* Wide range of built-in middlewares and extensions for handling of
compression, cache, cookies, authentication, user-agent spoofing, robots.txt
handling, statistics, crawl depth restriction, etc
2010-03-20 20:24:18 -03:00
* An :ref: `Interactive scraping shell console <topics-shell>` , very useful for
writing and debugging your spiders
2010-06-09 13:46:22 -03:00
* A builtin :ref: `Web service <topics-webservice>` for monitoring and
2010-03-20 20:24:18 -03:00
controlling your bot
2009-01-07 14:02:28 +00:00
2010-03-20 20:24:18 -03:00
* A :ref: `Telnet console <topics-telnetconsole>` for full unrestricted access
to a Python console inside your Scrapy process, to introspect and debug your
crawler
2009-01-07 14:02:28 +00:00
2010-03-20 20:24:18 -03:00
* Built-in facilities for :ref: `logging <topics-logging>` , :ref:`collecting
stats <topics-stats>`, and :ref:` sending email notifications <topics-email>`
2009-01-07 14:02:28 +00:00
2009-01-09 22:45:58 +00:00
What's next?
============
2009-01-07 03:58:15 +00:00
The next obvious steps are for you to `download Scrapy`_ , read :ref:`the
2009-04-12 09:05:00 +00:00
tutorial <intro-tutorial>` and join `the community`_ . Thanks for your
2009-01-07 03:58:15 +00:00
interest!
.. _download Scrapy: http://scrapy.org/download/
.. _the community: http://scrapy.org/community/