2009-01-07 03:58:15 +00:00
|
|
|
.. _intro-overview:
|
2008-12-26 12:21:53 +00:00
|
|
|
|
2009-01-07 03:58:15 +00:00
|
|
|
==================
|
|
|
|
Scrapy at a glance
|
|
|
|
==================
|
2008-12-26 12:21:53 +00:00
|
|
|
|
2010-08-21 01:26:35 -03:00
|
|
|
Scrapy is an application framework for crawling web sites and extracting
|
2009-01-07 03:58:15 +00:00
|
|
|
structured data which can be used for a wide range of useful applications, like
|
|
|
|
data mining, information processing or historical archival.
|
2008-12-26 12:21:53 +00:00
|
|
|
|
2010-03-20 20:24:18 -03:00
|
|
|
Even though Scrapy was originally designed for `screen scraping`_ (more
|
|
|
|
precisely, `web scraping`_), it can also be used to extract data using APIs
|
|
|
|
(such as `Amazon Associates Web Services`_) or as a general purpose web
|
|
|
|
crawler.
|
2008-12-26 12:21:53 +00:00
|
|
|
|
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
Walk-through of an example spider
|
|
|
|
=================================
|
2008-12-26 12:21:53 +00:00
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
In order to show you what Scrapy brings to the table, we'll walk you
|
|
|
|
through an example of a Scrapy Spider using the simplest way to run a spider.
|
2008-12-26 12:21:53 +00:00
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
Once you're ready to dive in more, you can :ref:`follow the tutorial
|
|
|
|
and build a full-blown Scrapy project <intro-tutorial>`.
|
2008-12-26 12:21:53 +00:00
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
So, here's the code for a spider that follows the links to the top
|
|
|
|
voted questions on StackOverflow and scrapes some data from each page::
|
2010-09-05 23:38:37 -03:00
|
|
|
|
2014-04-09 05:34:14 +06:00
|
|
|
import scrapy
|
2010-09-05 23:38:37 -03:00
|
|
|
|
2009-01-07 03:58:15 +00:00
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
class StackOverflowSpider(scrapy.Spider):
|
|
|
|
name = 'stackoverflow'
|
|
|
|
start_urls = ['http://stackoverflow.com/questions?sort=votes']
|
2009-01-07 03:58:15 +00:00
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
def parse(self, response):
|
|
|
|
for href in response.css('.question-summary h3 a::attr(href)'):
|
|
|
|
full_url = response.urljoin(href.extract())
|
|
|
|
yield scrapy.Request(full_url, callback=self.parse_question)
|
2009-01-07 03:58:15 +00:00
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
def parse_question(self, response):
|
|
|
|
title = response.css('h1 a::text').extract_first()
|
|
|
|
votes = response.css('.question .vote-count-post::text').extract_first()
|
|
|
|
tags = response.css('.question .post-tag::text').extract()
|
|
|
|
body = response.css('.question .post-text').extract_first()
|
|
|
|
yield {
|
|
|
|
'title': title,
|
|
|
|
'votes': votes,
|
|
|
|
'body': body,
|
|
|
|
'tags': tags,
|
|
|
|
'link': response.url,
|
|
|
|
}
|
2009-01-07 03:58:15 +00:00
|
|
|
|
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
Put this in a file, name it to something like ``stackoverflow_spider.py``
|
|
|
|
and run the spider using the :command:`runspider` command::
|
2009-01-07 03:58:15 +00:00
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
scrapy runspider stackoverflow_spider.py -o top-stackoverflow-questions.json
|
2009-08-24 15:11:04 -03:00
|
|
|
|
2009-01-07 03:58:15 +00:00
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
When this finishes you will have in the ``top-stackoverflow-questions.json`` file
|
|
|
|
a list of the most upvoted questions in StackOverflow in JSON format, containing the
|
|
|
|
title, link, number of upvotes, a list of the tags and the question content in HTML.
|
2009-01-07 03:58:15 +00:00
|
|
|
|
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
What just happened?
|
|
|
|
-------------------
|
2009-01-07 03:58:15 +00:00
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
When you ran the command ``scrapy runspider somefile.py``, Scrapy looked
|
|
|
|
for a Spider definition inside it and ran it through its crawler engine.
|
2009-01-07 03:58:15 +00:00
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
The crawl started by making requests to the URLs defined in the ``start_urls``
|
|
|
|
attribute (in this case, only the URL for StackOverflow top questions page),
|
|
|
|
and then called the default callback method ``parse`` passing the response
|
|
|
|
object as an argument.
|
2009-01-07 03:58:15 +00:00
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
In the ``parse`` callback, we scrape the links to the questions and
|
|
|
|
yield a few more requests to be processed, registering for them
|
|
|
|
the method ``parse_question`` as the callback to be called when the
|
|
|
|
requests are complete.
|
2009-01-07 03:58:15 +00:00
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
Finally, the ``parse_question`` callback scrapes the question data
|
|
|
|
for each page yielding a dict, which Scrapy then collects and
|
|
|
|
writes to a JSON file as requested in the command line.
|
2014-01-10 10:37:27 -08:00
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
.. note::
|
2009-01-07 03:58:15 +00:00
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
This is using :ref:`feed exports <topics-feed-exports>` to generate the
|
|
|
|
JSON file, you can easily change the export format (XML or CSV, for example) or the
|
|
|
|
storage backend (FTP or `Amazon S3`_, for example). You can also write an
|
|
|
|
:ref:`item pipeline <topics-item-pipeline>` to store the items in a database.
|
2014-01-12 23:46:04 -06:00
|
|
|
|
2010-09-06 10:04:00 -03:00
|
|
|
|
2012-04-19 02:37:22 -03:00
|
|
|
.. _topics-whatelse:
|
|
|
|
|
2009-01-07 14:02:28 +00:00
|
|
|
What else?
|
2009-01-07 03:58:15 +00:00
|
|
|
==========
|
|
|
|
|
|
|
|
You've seen how to extract and store items from a website using Scrapy, but
|
|
|
|
this is just the surface. Scrapy provides a lot of powerful features for making
|
2009-01-07 14:02:28 +00:00
|
|
|
scraping easy and efficient, such as:
|
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
* An :ref:`interactive shell console <topics-shell>` (IPython aware) for trying
|
|
|
|
out the CSS and XPath expressions to scrape data, very useful when writing or
|
|
|
|
debugging your spiders.
|
2010-09-05 23:38:37 -03:00
|
|
|
|
2010-08-17 14:27:48 -03:00
|
|
|
* Built-in support for :ref:`generating feed exports <topics-feed-exports>` in
|
|
|
|
multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP,
|
2010-09-05 23:38:37 -03:00
|
|
|
S3, local filesystem)
|
2009-01-07 14:02:28 +00:00
|
|
|
|
2010-10-16 19:02:08 -02:00
|
|
|
* Robust encoding support and auto-detection, for dealing with foreign,
|
|
|
|
non-standard and broken encoding declarations.
|
2009-01-07 14:02:28 +00:00
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
* Strong :ref:`extensibility support <extending-scrapy>` and lots of built-in
|
|
|
|
extensions and middlewares to handle things like cookies, crawl throttling,
|
|
|
|
HTTP caching, HTTP compression, user-agent spoofing, robots.txt,
|
|
|
|
stats collection and many more.
|
2010-03-20 20:24:18 -03:00
|
|
|
|
2010-09-05 23:38:37 -03:00
|
|
|
* A :ref:`Telnet console <topics-telnetconsole>` for hooking into a Python
|
|
|
|
console running inside your Scrapy process, to introspect and debug your
|
2010-03-20 20:24:18 -03:00
|
|
|
crawler
|
2009-01-07 14:02:28 +00:00
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
* A caching DNS resolver
|
2009-01-07 14:02:28 +00:00
|
|
|
|
2011-06-15 11:54:34 -03:00
|
|
|
* Support for crawling based on URLs discovered through `Sitemaps`_
|
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
* A media pipeline for :ref:`automatically downloading images <topics-images>`
|
|
|
|
(or any other media) associated with the scraped items
|
2011-07-27 03:45:15 -03:00
|
|
|
|
2009-01-09 22:45:58 +00:00
|
|
|
What's next?
|
|
|
|
============
|
2009-01-07 03:58:15 +00:00
|
|
|
|
2015-03-25 19:24:36 -03:00
|
|
|
The next obvious steps for you are to `download Scrapy`_, read :ref:`the
|
2009-04-12 09:05:00 +00:00
|
|
|
tutorial <intro-tutorial>` and join `the community`_. Thanks for your
|
2009-01-07 03:58:15 +00:00
|
|
|
interest!
|
|
|
|
|
|
|
|
.. _download Scrapy: http://scrapy.org/download/
|
|
|
|
.. _the community: http://scrapy.org/community/
|
2010-09-05 23:38:37 -03:00
|
|
|
.. _screen scraping: http://en.wikipedia.org/wiki/Screen_scraping
|
|
|
|
.. _web scraping: http://en.wikipedia.org/wiki/Web_scraping
|
2015-03-25 19:24:36 -03:00
|
|
|
.. _Amazon Associates Web Services: http://aws.amazon.com/associates/
|
2010-09-05 23:38:37 -03:00
|
|
|
.. _Amazon S3: http://aws.amazon.com/s3/
|
2011-06-15 11:54:34 -03:00
|
|
|
.. _Sitemaps: http://www.sitemaps.org
|