From 0fc73a9d558158f1686f9cc9c289fe364b5df536 Mon Sep 17 00:00:00 2001 From: Mikhail Korobov Date: Fri, 16 Dec 2016 21:47:58 +0500 Subject: [PATCH] DOC update examples with long longger names --- docs/intro/tutorial.rst | 24 +++---- docs/topics/benchmarking.rst | 92 +++++++++++++++++---------- docs/topics/downloader-middleware.rst | 8 +-- docs/topics/settings.rst | 2 +- docs/topics/shell.rst | 10 +-- 5 files changed, 81 insertions(+), 55 deletions(-) diff --git a/docs/intro/tutorial.rst b/docs/intro/tutorial.rst index 0941eb1e5..8e14d1b7c 100644 --- a/docs/intro/tutorial.rst +++ b/docs/intro/tutorial.rst @@ -130,15 +130,15 @@ will send some requests for the ``quotes.toscrape.com`` domain. You will get an similar to this:: ... (omitted for brevity) - 2016-09-20 14:48:00 [scrapy] INFO: Spider opened - 2016-09-20 14:48:00 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) - 2016-09-20 14:48:00 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 - 2016-09-20 14:48:00 [scrapy] DEBUG: Crawled (404) (referer: None) - 2016-09-20 14:48:00 [scrapy] DEBUG: Crawled (200) (referer: None) - 2016-09-20 14:48:01 [quotes] DEBUG: Saved file quotes-1.html - 2016-09-20 14:48:01 [scrapy] DEBUG: Crawled (200) (referer: None) - 2016-09-20 14:48:01 [quotes] DEBUG: Saved file quotes-2.html - 2016-09-20 14:48:01 [scrapy] INFO: Closing spider (finished) + 2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened + 2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) + 2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 + 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) (referer: None) + 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) + 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) + 2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html + 2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html + 2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished) ... Now, check the files in the current directory. You should notice that two new @@ -212,7 +212,7 @@ using the shell :ref:`Scrapy shell `. Run:: You will see something like:: [ ... Scrapy log here ... ] - 2016-09-19 12:09:27 [scrapy] DEBUG: Crawled (200) (referer: None) + 2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler @@ -429,9 +429,9 @@ in the callback, as you can see below:: If you run this spider, it will output the extracted data with the log:: - 2016-09-19 18:57:19 [scrapy] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> + 2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'} - 2016-09-19 18:57:19 [scrapy] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> + 2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"} diff --git a/docs/topics/benchmarking.rst b/docs/topics/benchmarking.rst index 632190067..99469ebf1 100644 --- a/docs/topics/benchmarking.rst +++ b/docs/topics/benchmarking.rst @@ -18,40 +18,66 @@ To run it use:: You should see an output like this:: - 2013-05-16 13:08:46-0300 [scrapy] INFO: Scrapy 0.17.0 started (bot: scrapybot) - 2013-05-16 13:08:47-0300 [scrapy] INFO: Spider opened - 2013-05-16 13:08:47-0300 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) - 2013-05-16 13:08:48-0300 [scrapy] INFO: Crawled 74 pages (at 4440 pages/min), scraped 0 items (at 0 items/min) - 2013-05-16 13:08:49-0300 [scrapy] INFO: Crawled 143 pages (at 4140 pages/min), scraped 0 items (at 0 items/min) - 2013-05-16 13:08:50-0300 [scrapy] INFO: Crawled 210 pages (at 4020 pages/min), scraped 0 items (at 0 items/min) - 2013-05-16 13:08:51-0300 [scrapy] INFO: Crawled 274 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) - 2013-05-16 13:08:52-0300 [scrapy] INFO: Crawled 343 pages (at 4140 pages/min), scraped 0 items (at 0 items/min) - 2013-05-16 13:08:53-0300 [scrapy] INFO: Crawled 410 pages (at 4020 pages/min), scraped 0 items (at 0 items/min) - 2013-05-16 13:08:54-0300 [scrapy] INFO: Crawled 474 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) - 2013-05-16 13:08:55-0300 [scrapy] INFO: Crawled 538 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) - 2013-05-16 13:08:56-0300 [scrapy] INFO: Crawled 602 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) - 2013-05-16 13:08:57-0300 [scrapy] INFO: Closing spider (closespider_timeout) - 2013-05-16 13:08:57-0300 [scrapy] INFO: Crawled 666 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) - 2013-05-16 13:08:57-0300 [scrapy] INFO: Dumping Scrapy stats: - {'downloader/request_bytes': 231508, - 'downloader/request_count': 682, - 'downloader/request_method_count/GET': 682, - 'downloader/response_bytes': 1172802, - 'downloader/response_count': 682, - 'downloader/response_status_count/200': 682, - 'finish_reason': 'closespider_timeout', - 'finish_time': datetime.datetime(2013, 5, 16, 16, 8, 57, 985539), - 'log_count/INFO': 14, - 'request_depth_max': 34, - 'response_received_count': 682, - 'scheduler/dequeued': 682, - 'scheduler/dequeued/memory': 682, - 'scheduler/enqueued': 12767, - 'scheduler/enqueued/memory': 12767, - 'start_time': datetime.datetime(2013, 5, 16, 16, 8, 47, 676539)} - 2013-05-16 13:08:57-0300 [scrapy] INFO: Spider closed (closespider_timeout) + 2016-12-16 21:18:48 [scrapy.utils.log] INFO: Scrapy 1.2.2 started (bot: quotesbot) + 2016-12-16 21:18:48 [scrapy.utils.log] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['quotesbot.spiders'], 'LOGSTATS_INTERVAL': 1, 'BOT_NAME': 'quotesbot', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'quotesbot.spiders'} + 2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled extensions: + ['scrapy.extensions.closespider.CloseSpider', + 'scrapy.extensions.logstats.LogStats', + 'scrapy.extensions.telnet.TelnetConsole', + 'scrapy.extensions.corestats.CoreStats'] + 2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled downloader middlewares: + ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', + 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', + 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', + 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', + 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', + 'scrapy.downloadermiddlewares.retry.RetryMiddleware', + 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', + 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', + 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', + 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', + 'scrapy.downloadermiddlewares.stats.DownloaderStats'] + 2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled spider middlewares: + ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', + 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', + 'scrapy.spidermiddlewares.referer.RefererMiddleware', + 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', + 'scrapy.spidermiddlewares.depth.DepthMiddleware'] + 2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled item pipelines: + [] + 2016-12-16 21:18:49 [scrapy.core.engine] INFO: Spider opened + 2016-12-16 21:18:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) + 2016-12-16 21:18:50 [scrapy.extensions.logstats] INFO: Crawled 70 pages (at 4200 pages/min), scraped 0 items (at 0 items/min) + 2016-12-16 21:18:51 [scrapy.extensions.logstats] INFO: Crawled 134 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) + 2016-12-16 21:18:52 [scrapy.extensions.logstats] INFO: Crawled 198 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) + 2016-12-16 21:18:53 [scrapy.extensions.logstats] INFO: Crawled 254 pages (at 3360 pages/min), scraped 0 items (at 0 items/min) + 2016-12-16 21:18:54 [scrapy.extensions.logstats] INFO: Crawled 302 pages (at 2880 pages/min), scraped 0 items (at 0 items/min) + 2016-12-16 21:18:55 [scrapy.extensions.logstats] INFO: Crawled 358 pages (at 3360 pages/min), scraped 0 items (at 0 items/min) + 2016-12-16 21:18:56 [scrapy.extensions.logstats] INFO: Crawled 406 pages (at 2880 pages/min), scraped 0 items (at 0 items/min) + 2016-12-16 21:18:57 [scrapy.extensions.logstats] INFO: Crawled 438 pages (at 1920 pages/min), scraped 0 items (at 0 items/min) + 2016-12-16 21:18:58 [scrapy.extensions.logstats] INFO: Crawled 470 pages (at 1920 pages/min), scraped 0 items (at 0 items/min) + 2016-12-16 21:18:59 [scrapy.core.engine] INFO: Closing spider (closespider_timeout) + 2016-12-16 21:18:59 [scrapy.extensions.logstats] INFO: Crawled 518 pages (at 2880 pages/min), scraped 0 items (at 0 items/min) + 2016-12-16 21:19:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats: + {'downloader/request_bytes': 229995, + 'downloader/request_count': 534, + 'downloader/request_method_count/GET': 534, + 'downloader/response_bytes': 1565504, + 'downloader/response_count': 534, + 'downloader/response_status_count/200': 534, + 'finish_reason': 'closespider_timeout', + 'finish_time': datetime.datetime(2016, 12, 16, 16, 19, 0, 647725), + 'log_count/INFO': 17, + 'request_depth_max': 19, + 'response_received_count': 534, + 'scheduler/dequeued': 533, + 'scheduler/dequeued/memory': 533, + 'scheduler/enqueued': 10661, + 'scheduler/enqueued/memory': 10661, + 'start_time': datetime.datetime(2016, 12, 16, 16, 18, 49, 799869)} + 2016-12-16 21:19:00 [scrapy.core.engine] INFO: Spider closed (closespider_timeout) -That tells you that Scrapy is able to crawl about 3900 pages per minute in the +That tells you that Scrapy is able to crawl about 3000 pages per minute in the hardware where you run it. Note that this is a very simple spider intended to follow links, any custom spider you write will probably do more stuff which results in slower crawl rates. How slower depends on how much your spider does diff --git a/docs/topics/downloader-middleware.rst b/docs/topics/downloader-middleware.rst index 29d9b0298..3b9a5335a 100644 --- a/docs/topics/downloader-middleware.rst +++ b/docs/topics/downloader-middleware.rst @@ -238,14 +238,14 @@ header) and all cookies received in responses (ie. ``Set-Cookie`` header). Here's an example of a log with :setting:`COOKIES_DEBUG` enabled:: - 2011-04-06 14:35:10-0300 [scrapy] INFO: Spider opened - 2011-04-06 14:35:10-0300 [scrapy] DEBUG: Sending cookies to: + 2011-04-06 14:35:10-0300 [scrapy.core.engine] INFO: Spider opened + 2011-04-06 14:35:10-0300 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: Cookie: clientlanguage_nl=en_EN - 2011-04-06 14:35:14-0300 [scrapy] DEBUG: Received cookies from: <200 http://www.diningcity.com/netherlands/index.html> + 2011-04-06 14:35:14-0300 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <200 http://www.diningcity.com/netherlands/index.html> Set-Cookie: JSESSIONID=B~FA4DC0C496C8762AE4F1A620EAB34F38; Path=/ Set-Cookie: ip_isocode=US Set-Cookie: clientlanguage_nl=en_EN; Expires=Thu, 07-Apr-2011 21:21:34 GMT; Path=/ - 2011-04-06 14:49:50-0300 [scrapy] DEBUG: Crawled (200) (referer: None) + 2011-04-06 14:49:50-0300 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) [...] diff --git a/docs/topics/settings.rst b/docs/topics/settings.rst index 503f4afb1..0515a9e0d 100644 --- a/docs/topics/settings.rst +++ b/docs/topics/settings.rst @@ -1037,7 +1037,7 @@ Stats counter (``scheduler/unserializable``) tracks the number of times this hap Example entry in logs:: - 1956-01-31 00:00:00+0800 [scrapy] ERROR: Unable to serialize request: + 1956-01-31 00:00:00+0800 [scrapy.core.scheduler] ERROR: Unable to serialize request: - reason: cannot serialize (type Request)> - no more unserializable requests will be logged (see 'scheduler/unserializable' stats counter) diff --git a/docs/topics/shell.rst b/docs/topics/shell.rst index 322c3ddfa..da91108b2 100644 --- a/docs/topics/shell.rst +++ b/docs/topics/shell.rst @@ -173,7 +173,7 @@ all start with the ``[s]`` prefix):: After that, we can start playing with the objects:: >>> response.xpath('//title/text()').extract_first() - u'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework' + 'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework' >>> fetch("http://reddit.com") [s] Available Scrapy objects: @@ -189,7 +189,7 @@ After that, we can start playing with the objects:: [s] view(response) View response in a browser >>> response.xpath('//title/text()').extract() - [u'reddit: the front page of the internet'] + ['reddit: the front page of the internet'] >>> request = request.replace(method="POST") @@ -234,8 +234,8 @@ Here's an example of how you would call it from your spider:: When you run the spider, you will get something similar to this:: - 2014-01-23 17:48:31-0400 [scrapy] DEBUG: Crawled (200) (referer: None) - 2014-01-23 17:48:31-0400 [scrapy] DEBUG: Crawled (200) (referer: None) + 2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) + 2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) [s] Available Scrapy objects: [s] crawler ... @@ -258,7 +258,7 @@ Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling:: >>> ^D - 2014-01-23 17:50:03-0400 [scrapy] DEBUG: Crawled (200) (referer: None) + 2014-01-23 17:50:03-0400 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) ... Note that you can't use the ``fetch`` shortcut here since the Scrapy engine is