addressing comments from the review plus further editing

2025-02-24 18:24:00 +00:00 · 2015-03-26 14:26:20 -03:00 · 2015-03-26 14:26:20 -03:00 · 76e3bf1250
commit 76e3bf1250
parent 8f4a268f37
2 changed files with 56 additions and 29 deletions
--- a/docs/intro/overview.rst
+++ b/docs/intro/overview.rst
@ -39,15 +39,11 @@ voted questions on StackOverflow and scrapes some data from each page::
                yield scrapy.Request(full_url, callback=self.parse_question)

        def parse_question(self, response):
-            title = response.css('h1 a::text').extract_first()
-            votes = response.css('.question .vote-count-post::text').extract_first()
-            tags = response.css('.question .post-tag::text').extract()
-            body = response.css('.question .post-text').extract_first()
            yield {
-                'title': title,
-                'votes': votes,
-                'body': body,
-                'tags': tags,
+                'title': response.css('h1 a::text').extract()[0],
+                'votes': response.css('.question .vote-count-post::text').extract()[0],
+                'body': response.css('.question .post-text').extract()[0],
+                'tags': response.css('.question .post-tag::text').extract(),
                'link': response.url,
            }

@ -66,28 +62,36 @@ title, link, number of upvotes, a list of the tags and the question content in H
 What just happened?
 -------------------

-When you ran the command ``scrapy runspider somefile.py``, Scrapy looked
-for a Spider definition inside it and ran it through its crawler engine.
+When you ran the command ``scrapy runspider somefile.py``, Scrapy looked for a
+Spider definition inside it and ran it through its crawler engine.

 The crawl started by making requests to the URLs defined in the ``start_urls``
 attribute (in this case, only the URL for StackOverflow top questions page),
-and then called the default callback method ``parse`` passing the response
-object as an argument.
+and called the default callback method ``parse`` passing the response object as
+an argument. In the ``parse`` callback, we extract the links to the
+question pages using a CSS Selector with a custom extension that allows to get
+the value for an attribute. Then, we yield a few more requests to be sent,
+registering the method ``parse_question`` as the callback to be called for each
+of them as they finish.

 Here you notice one of the main advantages about Scrapy: requests are
-scheduled and processed asynchronously. This means that Scrapy doesn't
-need to wait for a request to be finished and processed, it can send
-another request or do other things in the meantime, which results in much
-faster crawlings.
+:ref:`scheduled and processed asynchronously <topics-architecture>`.  This
+means that Scrapy doesn't need to wait for a request to be finished and
+processed, it can send another request or do other things in the meantime. This
+also means that other requests can keep going even if some request fails or an
+error happens while handling it.

-So, in the ``parse`` callback, we scrape the links to the questions and
-yield a few more requests to be done, registering for them the method
-``parse_question`` as the callback to be called for each of them as
-they finish.
+While this enables you to do very fast crawlings sending multiple concurrent
+requests at the same time in a fault-tolerant way, Scrapy also gives you
+control over the politeness of the crawl through :ref:`a few settings
+<topics-settings-ref>`. You can do things like setting a download delay between
+each request, limit amount of concurrent requests per domain or per IP, and
+even :ref:`use an auto-throttling extension <topics-autothrottle>` that tries
+to figure out these automatically.

-Finally, the ``parse_question`` callback scrapes the question data
-for each page yielding a dict, which Scrapy then collects and
-writes to a JSON file as requested in the command line.
+Finally, the ``parse_question`` callback scrapes the question data for each
+page yielding a dict, which Scrapy then collects and writes to a JSON file as
+requested in the command line.

 .. note::

@ -96,6 +100,25 @@ writes to a JSON file as requested in the command line.
    storage backend (FTP or `Amazon S3`_, for example).  You can also write an
    :ref:`item pipeline <topics-item-pipeline>` to store the items in a database.

+The data in the file will look like this (note: formatted for easier reading)::
+
+
+    [{
+            "body": "... LONG HTML HERE ...",
+            "link": "http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array",
+            "tags": ["java", "c++", "performance", "optimization"],
+            "title": "Why is processing a sorted array faster than an unsorted array?",
+            "votes": "9924"
+        },
+        {
+            "body": "... LONG HTML HERE ...",
+            "link": "http://stackoverflow.com/questions/1260748/how-do-i-remove-a-git-submodule",
+            "tags": ["git", "git-submodules"],
+            "title": "How do I remove a Git submodule?",
+            "votes": "1764"
+        },
+        ...]
+

 .. _topics-whatelse:

@ -106,6 +129,10 @@ You've seen how to extract and store items from a website using Scrapy, but
 this is just the surface. Scrapy provides a lot of powerful features for making
 scraping easy and efficient, such as:

+* Built-in support for :ref:`selecting and extracting <topics-selectors>` data
+  from HTML/XML sources using CSS selectors extended and XPath expressions,
+  with helper methods to extract using regular expressions.
+
 * An :ref:`interactive shell console <topics-shell>` (IPython aware) for trying
  out the CSS and XPath expressions to scrape data, very useful when writing or
  debugging your spiders.
@ -126,12 +153,10 @@ scraping easy and efficient, such as:
  console running inside your Scrapy process, to introspect and debug your
  crawler

-* A caching DNS resolver
-
-* Support for crawling based on URLs discovered through `Sitemaps`_
-
-* A media pipeline for :ref:`automatically downloading images <topics-images>`
-  (or any other media) associated with the scraped items
+* Plus other goodies like reusable spiders to crawl sites from `Sitemaps`_ and
+  XML/CSV feeds, a media pipeline for :ref:`automatically downloading images <topics-images>`
+  (or any other media) associated with the scraped items, a caching DNS resolver,
+  and much more!

 What's next?
 ============
--- a/docs/topics/autothrottle.rst
+++ b/docs/topics/autothrottle.rst
@ -1,3 +1,5 @@
+.. _topics-autothrottle:
+
 ======================
 AutoThrottle extension
 ======================