1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-24 18:24:00 +00:00

addressing comments from the review plus further editing

This commit is contained in:
Elias Dorneles 2015-03-26 14:26:20 -03:00
parent 8f4a268f37
commit 76e3bf1250
2 changed files with 56 additions and 29 deletions

View File

@ -39,15 +39,11 @@ voted questions on StackOverflow and scrapes some data from each page::
yield scrapy.Request(full_url, callback=self.parse_question)
def parse_question(self, response):
title = response.css('h1 a::text').extract_first()
votes = response.css('.question .vote-count-post::text').extract_first()
tags = response.css('.question .post-tag::text').extract()
body = response.css('.question .post-text').extract_first()
yield {
'title': title,
'votes': votes,
'body': body,
'tags': tags,
'title': response.css('h1 a::text').extract()[0],
'votes': response.css('.question .vote-count-post::text').extract()[0],
'body': response.css('.question .post-text').extract()[0],
'tags': response.css('.question .post-tag::text').extract(),
'link': response.url,
}
@ -66,28 +62,36 @@ title, link, number of upvotes, a list of the tags and the question content in H
What just happened?
-------------------
When you ran the command ``scrapy runspider somefile.py``, Scrapy looked
for a Spider definition inside it and ran it through its crawler engine.
When you ran the command ``scrapy runspider somefile.py``, Scrapy looked for a
Spider definition inside it and ran it through its crawler engine.
The crawl started by making requests to the URLs defined in the ``start_urls``
attribute (in this case, only the URL for StackOverflow top questions page),
and then called the default callback method ``parse`` passing the response
object as an argument.
and called the default callback method ``parse`` passing the response object as
an argument. In the ``parse`` callback, we extract the links to the
question pages using a CSS Selector with a custom extension that allows to get
the value for an attribute. Then, we yield a few more requests to be sent,
registering the method ``parse_question`` as the callback to be called for each
of them as they finish.
Here you notice one of the main advantages about Scrapy: requests are
scheduled and processed asynchronously. This means that Scrapy doesn't
need to wait for a request to be finished and processed, it can send
another request or do other things in the meantime, which results in much
faster crawlings.
:ref:`scheduled and processed asynchronously <topics-architecture>`. This
means that Scrapy doesn't need to wait for a request to be finished and
processed, it can send another request or do other things in the meantime. This
also means that other requests can keep going even if some request fails or an
error happens while handling it.
So, in the ``parse`` callback, we scrape the links to the questions and
yield a few more requests to be done, registering for them the method
``parse_question`` as the callback to be called for each of them as
they finish.
While this enables you to do very fast crawlings sending multiple concurrent
requests at the same time in a fault-tolerant way, Scrapy also gives you
control over the politeness of the crawl through :ref:`a few settings
<topics-settings-ref>`. You can do things like setting a download delay between
each request, limit amount of concurrent requests per domain or per IP, and
even :ref:`use an auto-throttling extension <topics-autothrottle>` that tries
to figure out these automatically.
Finally, the ``parse_question`` callback scrapes the question data
for each page yielding a dict, which Scrapy then collects and
writes to a JSON file as requested in the command line.
Finally, the ``parse_question`` callback scrapes the question data for each
page yielding a dict, which Scrapy then collects and writes to a JSON file as
requested in the command line.
.. note::
@ -96,6 +100,25 @@ writes to a JSON file as requested in the command line.
storage backend (FTP or `Amazon S3`_, for example). You can also write an
:ref:`item pipeline <topics-item-pipeline>` to store the items in a database.
The data in the file will look like this (note: formatted for easier reading)::
[{
"body": "... LONG HTML HERE ...",
"link": "http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array",
"tags": ["java", "c++", "performance", "optimization"],
"title": "Why is processing a sorted array faster than an unsorted array?",
"votes": "9924"
},
{
"body": "... LONG HTML HERE ...",
"link": "http://stackoverflow.com/questions/1260748/how-do-i-remove-a-git-submodule",
"tags": ["git", "git-submodules"],
"title": "How do I remove a Git submodule?",
"votes": "1764"
},
...]
.. _topics-whatelse:
@ -106,6 +129,10 @@ You've seen how to extract and store items from a website using Scrapy, but
this is just the surface. Scrapy provides a lot of powerful features for making
scraping easy and efficient, such as:
* Built-in support for :ref:`selecting and extracting <topics-selectors>` data
from HTML/XML sources using CSS selectors extended and XPath expressions,
with helper methods to extract using regular expressions.
* An :ref:`interactive shell console <topics-shell>` (IPython aware) for trying
out the CSS and XPath expressions to scrape data, very useful when writing or
debugging your spiders.
@ -126,12 +153,10 @@ scraping easy and efficient, such as:
console running inside your Scrapy process, to introspect and debug your
crawler
* A caching DNS resolver
* Support for crawling based on URLs discovered through `Sitemaps`_
* A media pipeline for :ref:`automatically downloading images <topics-images>`
(or any other media) associated with the scraped items
* Plus other goodies like reusable spiders to crawl sites from `Sitemaps`_ and
XML/CSV feeds, a media pipeline for :ref:`automatically downloading images <topics-images>`
(or any other media) associated with the scraped items, a caching DNS resolver,
and much more!
What's next?
============

View File

@ -1,3 +1,5 @@
.. _topics-autothrottle:
======================
AutoThrottle extension
======================