scrapy/docs/intro/overview.rst

.. _intro-overview:

==================
Scrapy at a glance
==================

Scrapy is an application framework for crawling web sites and extracting
structured data which can be used for a wide range of useful applications, like
data mining, information processing or historical archival.

Even though Scrapy was originally designed for `screen scraping`_ (more
precisely, `web scraping`_), it can also be used to extract data using APIs
(such as `Amazon Associates Web Services`_) or as a general purpose web
crawler.


Walk-through of an example spider
=================================

In order to show you what Scrapy brings to the table, we'll walk you
through an example of a Scrapy Spider using the simplest way to run a spider.

Once you're ready to dive in more, you can :ref:`follow the tutorial
and build a full-blown Scrapy project <intro-tutorial>`.

So, here's the code for a spider that follows the links to the top
voted questions on StackOverflow and scrapes some data from each page::

    import scrapy


    class StackOverflowSpider(scrapy.Spider):
        name = 'stackoverflow'
        start_urls = ['http://stackoverflow.com/questions?sort=votes']

        def parse(self, response):
            for href in response.css('.question-summary h3 a::attr(href)'):
                full_url = response.urljoin(href.extract())
                yield scrapy.Request(full_url, callback=self.parse_question)

        def parse_question(self, response):
            title = response.css('h1 a::text').extract_first()
            votes = response.css('.question .vote-count-post::text').extract_first()
            tags = response.css('.question .post-tag::text').extract()
            body = response.css('.question .post-text').extract_first()
            yield {
                'title': title,
                'votes': votes,
                'body': body,
                'tags': tags,
                'link': response.url,
            }


Put this in a file, name it to something like ``stackoverflow_spider.py``
and run the spider using the :command:`runspider` command::

    scrapy runspider stackoverflow_spider.py -o top-stackoverflow-questions.json


When this finishes you will have in the ``top-stackoverflow-questions.json`` file
a list of the most upvoted questions in StackOverflow in JSON format, containing the
title, link, number of upvotes, a list of the tags and the question content in HTML.


What just happened?
-------------------

When you ran the command ``scrapy runspider somefile.py``, Scrapy looked
for a Spider definition inside it and ran it through its crawler engine.

The crawl started by making requests to the URLs defined in the ``start_urls``
attribute (in this case, only the URL for StackOverflow top questions page),
and then called the default callback method ``parse`` passing the response
object as an argument.

In the ``parse`` callback, we scrape the links to the questions and
yield a few more requests to be processed, registering for them
the method ``parse_question`` as the callback to be called when the
requests are complete.

Finally, the ``parse_question`` callback scrapes the question data
for each page yielding a dict, which Scrapy then collects and
writes to a JSON file as requested in the command line.

.. note::

    This is using :ref:`feed exports <topics-feed-exports>` to generate the
    JSON file, you can easily change the export format (XML or CSV, for example) or the
    storage backend (FTP or `Amazon S3`_, for example).  You can also write an
    :ref:`item pipeline <topics-item-pipeline>` to store the items in a database.


.. _topics-whatelse:

What else?
==========

You've seen how to extract and store items from a website using Scrapy, but
this is just the surface. Scrapy provides a lot of powerful features for making
scraping easy and efficient, such as:

* An :ref:`interactive shell console <topics-shell>` (IPython aware) for trying
  out the CSS and XPath expressions to scrape data, very useful when writing or
  debugging your spiders.

* Built-in support for :ref:`generating feed exports <topics-feed-exports>` in
  multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP,
  S3, local filesystem)

* Robust encoding support and auto-detection, for dealing with foreign,
  non-standard and broken encoding declarations.

* Strong :ref:`extensibility support <extending-scrapy>` and lots of built-in
  extensions and middlewares to handle things like cookies, crawl throttling,
  HTTP caching, HTTP compression, user-agent spoofing, robots.txt,
  stats collection and many more.

* A :ref:`Telnet console <topics-telnetconsole>` for hooking into a Python
  console running inside your Scrapy process, to introspect and debug your
  crawler

* A caching DNS resolver

* Support for crawling based on URLs discovered through `Sitemaps`_

* A media pipeline for :ref:`automatically downloading images <topics-images>`
  (or any other media) associated with the scraped items

What's next?
============

The next obvious steps for you are to `download Scrapy`_, read :ref:`the
tutorial <intro-tutorial>` and join `the community`_. Thanks for your
interest!

.. _download Scrapy: http://scrapy.org/download/
.. _the community: http://scrapy.org/community/
.. _screen scraping: http://en.wikipedia.org/wiki/Screen_scraping
.. _web scraping: http://en.wikipedia.org/wiki/Web_scraping
.. _Amazon Associates Web Services: http://aws.amazon.com/associates/
.. _Amazon S3: http://aws.amazon.com/s3/
.. _Sitemaps: http://www.sitemaps.org
improved overview doc. closes #44 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40668 2009-01-07 03:58:15 +00:00			`.. _intro-overview:`
Updated scrapy overview --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40549 2008-12-26 12:21:53 +00:00
improved overview doc. closes #44 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40668 2009-01-07 03:58:15 +00:00			`==================`
			`Scrapy at a glance`
			`==================`
Updated scrapy overview --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40549 2008-12-26 12:21:53 +00:00
Applied documentation patch provided by Lucian Ursu (closes #207) 2010-08-21 01:26:35 -03:00			`Scrapy is an application framework for crawling web sites and extracting`
improved overview doc. closes #44 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40668 2009-01-07 03:58:15 +00:00			`structured data which can be used for a wide range of useful applications, like`
			`data mining, information processing or historical archival.`
Updated scrapy overview --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40549 2008-12-26 12:21:53 +00:00
Improved "What else?" section of "Scrapy at a glance" overview 2010-03-20 20:24:18 -03:00			Even though Scrapy was originally designed for `screen scraping`_ (more
			precisely, `web scraping`_), it can also be used to extract data using APIs
			(such as `Amazon Associates Web Services`_) or as a general purpose web
			`crawler.`
Updated scrapy overview --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40549 2008-12-26 12:21:53 +00:00

some improvements to overview page 2015-03-25 19:24:36 -03:00			`Walk-through of an example spider`
			`=================================`
Updated scrapy overview --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40549 2008-12-26 12:21:53 +00:00
some improvements to overview page 2015-03-25 19:24:36 -03:00			`In order to show you what Scrapy brings to the table, we'll walk you`
			`through an example of a Scrapy Spider using the simplest way to run a spider.`
Updated scrapy overview --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40549 2008-12-26 12:21:53 +00:00
some improvements to overview page 2015-03-25 19:24:36 -03:00			Once you're ready to dive in more, you can :ref:`follow the tutorial
			and build a full-blown Scrapy project <intro-tutorial>`.
Updated scrapy overview --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40549 2008-12-26 12:21:53 +00:00
some improvements to overview page 2015-03-25 19:24:36 -03:00			`So, here's the code for a spider that follows the links to the top`
			`voted questions on StackOverflow and scrapes some data from each page::`
Updated 'Scrapy at a glance' document replacing item pipeline example by a simpler usage of feed exports 2010-09-05 23:38:37 -03:00
DOC use top-level shortcuts in docs 2014-04-09 05:34:14 +06:00			`import scrapy`
Updated 'Scrapy at a glance' document replacing item pipeline example by a simpler usage of feed exports 2010-09-05 23:38:37 -03:00
improved overview doc. closes #44 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40668 2009-01-07 03:58:15 +00:00
some improvements to overview page 2015-03-25 19:24:36 -03:00			`class StackOverflowSpider(scrapy.Spider):`
			`name = 'stackoverflow'`
			`start_urls = ['http://stackoverflow.com/questions?sort=votes']`
improved overview doc. closes #44 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40668 2009-01-07 03:58:15 +00:00
some improvements to overview page 2015-03-25 19:24:36 -03:00			`def parse(self, response):`
			`for href in response.css('.question-summary h3 a::attr(href)'):`
			`full_url = response.urljoin(href.extract())`
			`yield scrapy.Request(full_url, callback=self.parse_question)`
improved overview doc. closes #44 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40668 2009-01-07 03:58:15 +00:00
some improvements to overview page 2015-03-25 19:24:36 -03:00			`def parse_question(self, response):`
			`title = response.css('h1 a::text').extract_first()`
			`votes = response.css('.question .vote-count-post::text').extract_first()`
			`tags = response.css('.question .post-tag::text').extract()`
			`body = response.css('.question .post-text').extract_first()`
			`yield {`
			`'title': title,`
			`'votes': votes,`
			`'body': body,`
			`'tags': tags,`
			`'link': response.url,`
			`}`
improved overview doc. closes #44 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40668 2009-01-07 03:58:15 +00:00

some improvements to overview page 2015-03-25 19:24:36 -03:00			Put this in a file, name it to something like ``stackoverflow_spider.py``
			and run the spider using the :command:`runspider` command::
improved overview doc. closes #44 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40668 2009-01-07 03:58:15 +00:00
some improvements to overview page 2015-03-25 19:24:36 -03:00			`scrapy runspider stackoverflow_spider.py -o top-stackoverflow-questions.json`
changed torrent in overview doc 2009-08-24 15:11:04 -03:00
improved overview doc. closes #44 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40668 2009-01-07 03:58:15 +00:00
some improvements to overview page 2015-03-25 19:24:36 -03:00			When this finishes you will have in the ``top-stackoverflow-questions.json`` file
			`a list of the most upvoted questions in StackOverflow in JSON format, containing the`
			`title, link, number of upvotes, a list of the tags and the question content in HTML.`
improved overview doc. closes #44 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40668 2009-01-07 03:58:15 +00:00

some improvements to overview page 2015-03-25 19:24:36 -03:00			`What just happened?`
			`-------------------`
improved overview doc. closes #44 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40668 2009-01-07 03:58:15 +00:00
some improvements to overview page 2015-03-25 19:24:36 -03:00			When you ran the command ``scrapy runspider somefile.py``, Scrapy looked
			`for a Spider definition inside it and ran it through its crawler engine.`
improved overview doc. closes #44 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40668 2009-01-07 03:58:15 +00:00
some improvements to overview page 2015-03-25 19:24:36 -03:00			The crawl started by making requests to the URLs defined in the ``start_urls``
			`attribute (in this case, only the URL for StackOverflow top questions page),`
			and then called the default callback method ``parse`` passing the response
			`object as an argument.`
improved overview doc. closes #44 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40668 2009-01-07 03:58:15 +00:00
some improvements to overview page 2015-03-25 19:24:36 -03:00			In the ``parse`` callback, we scrape the links to the questions and
			`yield a few more requests to be processed, registering for them`
			the method ``parse_question`` as the callback to be called when the
			`requests are complete.`
improved overview doc. closes #44 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40668 2009-01-07 03:58:15 +00:00
some improvements to overview page 2015-03-25 19:24:36 -03:00			Finally, the ``parse_question`` callback scrapes the question data
			`for each page yielding a dict, which Scrapy then collects and`
			`writes to a JSON file as requested in the command line.`
documentation code example correction corrections per pablohoffman 2014-01-10 10:37:27 -08:00
some improvements to overview page 2015-03-25 19:24:36 -03:00			`.. note::`
improved overview doc. closes #44 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40668 2009-01-07 03:58:15 +00:00
some improvements to overview page 2015-03-25 19:24:36 -03:00			This is using :ref:`feed exports <topics-feed-exports>` to generate the
			`JSON file, you can easily change the export format (XML or CSV, for example) or the`
			storage backend (FTP or `Amazon S3`_, for example). You can also write an
			:ref:`item pipeline <topics-item-pipeline>` to store the items in a database.
Changed TOR Info as previous was removed from www.mininova.org 2014-01-12 23:46:04 -06:00
docs/intro/overview.rst: add example of scraped data and introduce loaders 2010-09-06 10:04:00 -03:00
doc: update overview page to point to the genspider command. refs #107 2012-04-19 02:37:22 -03:00			`.. _topics-whatelse:`

minor update to overview doc --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40674 2009-01-07 14:02:28 +00:00			`What else?`
improved overview doc. closes #44 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40668 2009-01-07 03:58:15 +00:00			`==========`

			`You've seen how to extract and store items from a website using Scrapy, but`
			`this is just the surface. Scrapy provides a lot of powerful features for making`
minor update to overview doc --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40674 2009-01-07 14:02:28 +00:00			`scraping easy and efficient, such as:`

some improvements to overview page 2015-03-25 19:24:36 -03:00			* An :ref:`interactive shell console <topics-shell>` (IPython aware) for trying
			`out the CSS and XPath expressions to scrape data, very useful when writing or`
			`debugging your spiders.`
Updated 'Scrapy at a glance' document replacing item pipeline example by a simpler usage of feed exports 2010-09-05 23:38:37 -03:00
Added new Feed exports extension with documentation and storage tests. Closes #197. Also deprecated File export pipeline (to be removed in Scrapy 0.11). Still need to add tests for FeedExport main extension code. 2010-08-17 14:27:48 -03:00			* Built-in support for :ref:`generating feed exports <topics-feed-exports>` in
			`multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP,`
Updated 'Scrapy at a glance' document replacing item pipeline example by a simpler usage of feed exports 2010-09-05 23:38:37 -03:00			`S3, local filesystem)`
minor update to overview doc --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40674 2009-01-07 14:02:28 +00:00
Some minor improvements to feature list in Scrapy at a Glance documentation page 2010-10-16 19:02:08 -02:00			`* Robust encoding support and auto-detection, for dealing with foreign,`
			`non-standard and broken encoding declarations.`
minor update to overview doc --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40674 2009-01-07 14:02:28 +00:00
some improvements to overview page 2015-03-25 19:24:36 -03:00			* Strong :ref:`extensibility support <extending-scrapy>` and lots of built-in
			`extensions and middlewares to handle things like cookies, crawl throttling,`
			`HTTP caching, HTTP compression, user-agent spoofing, robots.txt,`
			`stats collection and many more.`
Improved "What else?" section of "Scrapy at a glance" overview 2010-03-20 20:24:18 -03:00
Updated 'Scrapy at a glance' document replacing item pipeline example by a simpler usage of feed exports 2010-09-05 23:38:37 -03:00			* A :ref:`Telnet console <topics-telnetconsole>` for hooking into a Python
			`console running inside your Scrapy process, to introspect and debug your`
Improved "What else?" section of "Scrapy at a glance" overview 2010-03-20 20:24:18 -03:00			`crawler`
minor update to overview doc --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40674 2009-01-07 14:02:28 +00:00
some improvements to overview page 2015-03-25 19:24:36 -03:00			`* A caching DNS resolver`
minor update to overview doc --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40674 2009-01-07 14:02:28 +00:00
added SitemapSpider, with tests and doc 2011-06-15 11:54:34 -03:00			* Support for crawling based on URLs discovered through `Sitemaps`_

some improvements to overview page 2015-03-25 19:24:36 -03:00			* A media pipeline for :ref:`automatically downloading images <topics-images>`
			`(or any other media) associated with the scraped items`
Added cached DNS resolver based on old caching resolver extension from scrapy.contrib.resolver. This new one is not an extension, it comes builtin and always enabled. 2011-07-27 03:45:15 -03:00
some minor doc improvements here and there --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40699 2009-01-09 22:45:58 +00:00			`What's next?`
			`============`
improved overview doc. closes #44 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40668 2009-01-07 03:58:15 +00:00
some improvements to overview page 2015-03-25 19:24:36 -03:00			The next obvious steps for you are to `download Scrapy`_, read :ref:`the
doc: fixed a couple of broken links --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401054 2009-04-12 09:05:00 +00:00			tutorial <intro-tutorial>` and join `the community`_. Thanks for your
improved overview doc. closes #44 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40668 2009-01-07 03:58:15 +00:00			`interest!`

			`.. _download Scrapy: http://scrapy.org/download/`
			`.. _the community: http://scrapy.org/community/`
Updated 'Scrapy at a glance' document replacing item pipeline example by a simpler usage of feed exports 2010-09-05 23:38:37 -03:00			`.. _screen scraping: http://en.wikipedia.org/wiki/Screen_scraping`
			`.. _web scraping: http://en.wikipedia.org/wiki/Web_scraping`
some improvements to overview page 2015-03-25 19:24:36 -03:00			`.. _Amazon Associates Web Services: http://aws.amazon.com/associates/`
Updated 'Scrapy at a glance' document replacing item pipeline example by a simpler usage of feed exports 2010-09-05 23:38:37 -03:00			`.. _Amazon S3: http://aws.amazon.com/s3/`
added SitemapSpider, with tests and doc 2011-06-15 11:54:34 -03:00			`.. _Sitemaps: http://www.sitemaps.org`