mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-23 20:44:15 +00:00
168 lines
6.6 KiB
ReStructuredText
168 lines
6.6 KiB
ReStructuredText
.. _intro-overview:
|
|
|
|
==================
|
|
Scrapy at a glance
|
|
==================
|
|
|
|
Scrapy is an application framework for crawling web sites and extracting
|
|
structured data which can be used for a wide range of useful applications, like
|
|
data mining, information processing or historical archival.
|
|
|
|
Even though Scrapy was originally designed for `web scraping`_, it can also be
|
|
used to extract data using APIs (such as `Amazon Associates Web Services`_) or
|
|
as a general purpose web crawler.
|
|
|
|
|
|
Walk-through of an example spider
|
|
=================================
|
|
|
|
In order to show you what Scrapy brings to the table, we'll walk you through an
|
|
example of a Scrapy Spider using the simplest way to run a spider.
|
|
|
|
Here's the code for a spider that scrapes famous quotes from website
|
|
http://quotes.toscrape.com, following the pagination::
|
|
|
|
import scrapy
|
|
|
|
|
|
class QuotesSpider(scrapy.Spider):
|
|
name = "quotes"
|
|
start_urls = [
|
|
'http://quotes.toscrape.com/tag/humor/',
|
|
]
|
|
|
|
def parse(self, response):
|
|
for quote in response.css('div.quote'):
|
|
yield {
|
|
'text': quote.css('span.text::text').extract_first(),
|
|
'author': quote.xpath('span/small/text()').extract_first(),
|
|
}
|
|
|
|
next_page = response.css('li.next a::attr("href")').extract_first()
|
|
if next_page is not None:
|
|
yield response.follow(next_page, self.parse)
|
|
|
|
|
|
Put this in a text file, name it to something like ``quotes_spider.py``
|
|
and run the spider using the :command:`runspider` command::
|
|
|
|
scrapy runspider quotes_spider.py -o quotes.json
|
|
|
|
|
|
When this finishes you will have in the ``quotes.json`` file a list of the
|
|
quotes in JSON format, containing text and author, looking like this (reformatted
|
|
here for better readability)::
|
|
|
|
[{
|
|
"author": "Jane Austen",
|
|
"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
|
|
},
|
|
{
|
|
"author": "Groucho Marx",
|
|
"text": "\u201cOutside of a dog, a book is man's best friend. Inside of a dog it's too dark to read.\u201d"
|
|
},
|
|
{
|
|
"author": "Steve Martin",
|
|
"text": "\u201cA day without sunshine is like, you know, night.\u201d"
|
|
},
|
|
...]
|
|
|
|
|
|
What just happened?
|
|
-------------------
|
|
|
|
When you ran the command ``scrapy runspider quotes_spider.py``, Scrapy looked for a
|
|
Spider definition inside it and ran it through its crawler engine.
|
|
|
|
The crawl started by making requests to the URLs defined in the ``start_urls``
|
|
attribute (in this case, only the URL for quotes in *humor* category)
|
|
and called the default callback method ``parse``, passing the response object as
|
|
an argument. In the ``parse`` callback, we loop through the quote elements
|
|
using a CSS Selector, yield a Python dict with the extracted quote text and author,
|
|
look for a link to the next page and schedule another request using the same
|
|
``parse`` method as callback.
|
|
|
|
Here you notice one of the main advantages about Scrapy: requests are
|
|
:ref:`scheduled and processed asynchronously <topics-architecture>`. This
|
|
means that Scrapy doesn't need to wait for a request to be finished and
|
|
processed, it can send another request or do other things in the meantime. This
|
|
also means that other requests can keep going even if some request fails or an
|
|
error happens while handling it.
|
|
|
|
While this enables you to do very fast crawls (sending multiple concurrent
|
|
requests at the same time, in a fault-tolerant way) Scrapy also gives you
|
|
control over the politeness of the crawl through :ref:`a few settings
|
|
<topics-settings-ref>`. You can do things like setting a download delay between
|
|
each request, limiting amount of concurrent requests per domain or per IP, and
|
|
even :ref:`using an auto-throttling extension <topics-autothrottle>` that tries
|
|
to figure out these automatically.
|
|
|
|
.. note::
|
|
|
|
This is using :ref:`feed exports <topics-feed-exports>` to generate the
|
|
JSON file, you can easily change the export format (XML or CSV, for example) or the
|
|
storage backend (FTP or `Amazon S3`_, for example). You can also write an
|
|
:ref:`item pipeline <topics-item-pipeline>` to store the items in a database.
|
|
|
|
|
|
.. _topics-whatelse:
|
|
|
|
What else?
|
|
==========
|
|
|
|
You've seen how to extract and store items from a website using Scrapy, but
|
|
this is just the surface. Scrapy provides a lot of powerful features for making
|
|
scraping easy and efficient, such as:
|
|
|
|
* Built-in support for :ref:`selecting and extracting <topics-selectors>` data
|
|
from HTML/XML sources using extended CSS selectors and XPath expressions,
|
|
with helper methods to extract using regular expressions.
|
|
|
|
* An :ref:`interactive shell console <topics-shell>` (IPython aware) for trying
|
|
out the CSS and XPath expressions to scrape data, very useful when writing or
|
|
debugging your spiders.
|
|
|
|
* Built-in support for :ref:`generating feed exports <topics-feed-exports>` in
|
|
multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP,
|
|
S3, local filesystem)
|
|
|
|
* Robust encoding support and auto-detection, for dealing with foreign,
|
|
non-standard and broken encoding declarations.
|
|
|
|
* :ref:`Strong extensibility support <extending-scrapy>`, allowing you to plug
|
|
in your own functionality using :ref:`signals <topics-signals>` and a
|
|
well-defined API (middlewares, :ref:`extensions <topics-extensions>`, and
|
|
:ref:`pipelines <topics-item-pipeline>`).
|
|
|
|
* Wide range of built-in extensions and middlewares for handling:
|
|
|
|
- cookies and session handling
|
|
- HTTP features like compression, authentication, caching
|
|
- user-agent spoofing
|
|
- robots.txt
|
|
- crawl depth restriction
|
|
- and more
|
|
|
|
* A :ref:`Telnet console <topics-telnetconsole>` for hooking into a Python
|
|
console running inside your Scrapy process, to introspect and debug your
|
|
crawler
|
|
|
|
* Plus other goodies like reusable spiders to crawl sites from `Sitemaps`_ and
|
|
XML/CSV feeds, a media pipeline for :ref:`automatically downloading images
|
|
<topics-media-pipeline>` (or any other media) associated with the scraped
|
|
items, a caching DNS resolver, and much more!
|
|
|
|
What's next?
|
|
============
|
|
|
|
The next steps for you are to :ref:`install Scrapy <intro-install>`,
|
|
:ref:`follow through the tutorial <intro-tutorial>` to learn how to create
|
|
a full-blown Scrapy project and `join the community`_. Thanks for your
|
|
interest!
|
|
|
|
.. _join the community: http://scrapy.org/community/
|
|
.. _web scraping: https://en.wikipedia.org/wiki/Web_scraping
|
|
.. _Amazon Associates Web Services: https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html
|
|
.. _Amazon S3: https://aws.amazon.com/s3/
|
|
.. _Sitemaps: http://www.sitemaps.org
|