mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-21 04:53:19 +00:00
Merge branch 'master' into backward
This commit is contained in:
commit
5dc94db847
@ -55,7 +55,7 @@ guidelines when you're going to report a new bug.
|
||||
|
||||
* search the `scrapy-users`_ list and `Scrapy subreddit`_ to see if it has
|
||||
been discussed there, or if you're not sure if what you're seeing is a bug.
|
||||
You can also ask in the `#scrapy` IRC channel.
|
||||
You can also ask in the ``#scrapy`` IRC channel.
|
||||
|
||||
* write **complete, reproducible, specific bug reports**. The smaller the test
|
||||
case, the better. Remember that other developers won't have your project to
|
||||
|
@ -4,7 +4,13 @@
|
||||
Scrapy |version| documentation
|
||||
==============================
|
||||
|
||||
This documentation contains everything you need to know about Scrapy.
|
||||
Scrapy is a fast high-level `web crawling`_ and `web scraping`_ framework, used
|
||||
to crawl websites and extract structured data from their pages. It can be used
|
||||
for a wide range of purposes, from data mining to monitoring and automated
|
||||
testing.
|
||||
|
||||
.. _web crawling: https://en.wikipedia.org/wiki/Web_crawler
|
||||
.. _web scraping: https://en.wikipedia.org/wiki/Web_scraping
|
||||
|
||||
Getting help
|
||||
============
|
||||
|
128
docs/news.rst
128
docs/news.rst
@ -149,7 +149,7 @@ Documentation improvements
|
||||
* improved links to beginner resources in the tutorial
|
||||
(:issue:`3367`, :issue:`3468`);
|
||||
* fixed :setting:`RETRY_HTTP_CODES` default values in docs (:issue:`3335`);
|
||||
* remove unused `DEPTH_STATS` option from docs (:issue:`3245`);
|
||||
* remove unused ``DEPTH_STATS`` option from docs (:issue:`3245`);
|
||||
* other cleanups (:issue:`3347`, :issue:`3350`, :issue:`3445`, :issue:`3544`,
|
||||
:issue:`3605`).
|
||||
|
||||
@ -1313,7 +1313,7 @@ Module Relocations
|
||||
|
||||
There’s been a large rearrangement of modules trying to improve the general
|
||||
structure of Scrapy. Main changes were separating various subpackages into
|
||||
new projects and dissolving both `scrapy.contrib` and `scrapy.contrib_exp`
|
||||
new projects and dissolving both ``scrapy.contrib`` and ``scrapy.contrib_exp``
|
||||
into top level packages. Backward compatibility was kept among internal
|
||||
relocations, while importing deprecated modules expect warnings indicating
|
||||
their new place.
|
||||
@ -1344,7 +1344,7 @@ Outsourced packages
|
||||
| | /scrapy-plugins/scrapy-jsonrpc>`_ |
|
||||
+-------------------------------------+-------------------------------------+
|
||||
|
||||
`scrapy.contrib_exp` and `scrapy.contrib` dissolutions
|
||||
``scrapy.contrib_exp`` and ``scrapy.contrib`` dissolutions
|
||||
|
||||
+-------------------------------------+-------------------------------------+
|
||||
| Old location | New location |
|
||||
@ -1556,7 +1556,7 @@ Code refactoring
|
||||
(:issue:`1078`)
|
||||
- Pydispatch pep8 (:issue:`992`)
|
||||
- Removed unused 'load=False' parameter from walk_modules() (:issue:`871`)
|
||||
- For consistency, use `job_dir` helper in `SpiderState` extension.
|
||||
- For consistency, use ``job_dir`` helper in ``SpiderState`` extension.
|
||||
(:issue:`805`)
|
||||
- rename "sflo" local variables to less cryptic "log_observer" (:issue:`775`)
|
||||
|
||||
@ -1669,10 +1669,10 @@ Enhancements
|
||||
cache middleware (:issue:`541`, :issue:`500`, :issue:`571`)
|
||||
- Expose current crawler in Scrapy shell (:issue:`557`)
|
||||
- Improve testsuite comparing CSV and XML exporters (:issue:`570`)
|
||||
- New `offsite/filtered` and `offsite/domains` stats (:issue:`566`)
|
||||
- New ``offsite/filtered`` and ``offsite/domains`` stats (:issue:`566`)
|
||||
- Support process_links as generator in CrawlSpider (:issue:`555`)
|
||||
- Verbose logging and new stats counters for DupeFilter (:issue:`553`)
|
||||
- Add a mimetype parameter to `MailSender.send()` (:issue:`602`)
|
||||
- Add a mimetype parameter to ``MailSender.send()`` (:issue:`602`)
|
||||
- Generalize file pipeline log messages (:issue:`622`)
|
||||
- Replace unencodeable codepoints with html entities in SGMLLinkExtractor (:issue:`565`)
|
||||
- Converted SEP documents to rst format (:issue:`629`, :issue:`630`,
|
||||
@ -1691,20 +1691,20 @@ Enhancements
|
||||
- Make scrapy.version_info a tuple of integers (:issue:`681`, :issue:`692`)
|
||||
- Infer exporter's output format from filename extensions
|
||||
(:issue:`546`, :issue:`659`, :issue:`760`)
|
||||
- Support case-insensitive domains in `url_is_from_any_domain()` (:issue:`693`)
|
||||
- Support case-insensitive domains in ``url_is_from_any_domain()`` (:issue:`693`)
|
||||
- Remove pep8 warnings in project and spider templates (:issue:`698`)
|
||||
- Tests and docs for `request_fingerprint` function (:issue:`597`)
|
||||
- Update SEP-19 for GSoC project `per-spider settings` (:issue:`705`)
|
||||
- Tests and docs for ``request_fingerprint`` function (:issue:`597`)
|
||||
- Update SEP-19 for GSoC project ``per-spider settings`` (:issue:`705`)
|
||||
- Set exit code to non-zero when contracts fails (:issue:`727`)
|
||||
- Add a setting to control what class is instanciated as Downloader component
|
||||
(:issue:`738`)
|
||||
- Pass response in `item_dropped` signal (:issue:`724`)
|
||||
- Improve `scrapy check` contracts command (:issue:`733`, :issue:`752`)
|
||||
- Document `spider.closed()` shortcut (:issue:`719`)
|
||||
- Document `request_scheduled` signal (:issue:`746`)
|
||||
- Pass response in ``item_dropped`` signal (:issue:`724`)
|
||||
- Improve ``scrapy check`` contracts command (:issue:`733`, :issue:`752`)
|
||||
- Document ``spider.closed()`` shortcut (:issue:`719`)
|
||||
- Document ``request_scheduled`` signal (:issue:`746`)
|
||||
- Add a note about reporting security issues (:issue:`697`)
|
||||
- Add LevelDB http cache storage backend (:issue:`626`, :issue:`500`)
|
||||
- Sort spider list output of `scrapy list` command (:issue:`742`)
|
||||
- Sort spider list output of ``scrapy list`` command (:issue:`742`)
|
||||
- Multiple documentation enhancemens and fixes
|
||||
(:issue:`575`, :issue:`587`, :issue:`590`, :issue:`596`, :issue:`610`,
|
||||
:issue:`617`, :issue:`618`, :issue:`627`, :issue:`613`, :issue:`643`,
|
||||
@ -1773,22 +1773,22 @@ Enhancements
|
||||
~~~~~~~~~~~~
|
||||
|
||||
- [**Backward incompatible**] Switched HTTPCacheMiddleware backend to filesystem (:issue:`541`)
|
||||
To restore old backend set `HTTPCACHE_STORAGE` to `scrapy.contrib.httpcache.DbmCacheStorage`
|
||||
To restore old backend set ``HTTPCACHE_STORAGE`` to ``scrapy.contrib.httpcache.DbmCacheStorage``
|
||||
- Proxy \https:// urls using CONNECT method (:issue:`392`, :issue:`397`)
|
||||
- Add a middleware to crawl ajax crawleable pages as defined by google (:issue:`343`)
|
||||
- Rename scrapy.spider.BaseSpider to scrapy.spider.Spider (:issue:`510`, :issue:`519`)
|
||||
- Selectors register EXSLT namespaces by default (:issue:`472`)
|
||||
- Unify item loaders similar to selectors renaming (:issue:`461`)
|
||||
- Make `RFPDupeFilter` class easily subclassable (:issue:`533`)
|
||||
- Make ``RFPDupeFilter`` class easily subclassable (:issue:`533`)
|
||||
- Improve test coverage and forthcoming Python 3 support (:issue:`525`)
|
||||
- Promote startup info on settings and middleware to INFO level (:issue:`520`)
|
||||
- Support partials in `get_func_args` util (:issue:`506`, issue:`504`)
|
||||
- Support partials in ``get_func_args`` util (:issue:`506`, issue:`504`)
|
||||
- Allow running indiviual tests via tox (:issue:`503`)
|
||||
- Update extensions ignored by link extractors (:issue:`498`)
|
||||
- Add middleware methods to get files/images/thumbs paths (:issue:`490`)
|
||||
- Improve offsite middleware tests (:issue:`478`)
|
||||
- Add a way to skip default Referer header set by RefererMiddleware (:issue:`475`)
|
||||
- Do not send `x-gzip` in default `Accept-Encoding` header (:issue:`469`)
|
||||
- Do not send ``x-gzip`` in default ``Accept-Encoding`` header (:issue:`469`)
|
||||
- Support defining http error handling using settings (:issue:`466`)
|
||||
- Use modern python idioms wherever you find legacies (:issue:`497`)
|
||||
- Improve and correct documentation
|
||||
@ -1799,14 +1799,14 @@ Fixes
|
||||
~~~~~
|
||||
|
||||
- Update Selector class imports in CrawlSpider template (:issue:`484`)
|
||||
- Fix unexistent reference to `engine.slots` (:issue:`464`)
|
||||
- Do not try to call `body_as_unicode()` on a non-TextResponse instance (:issue:`462`)
|
||||
- Fix unexistent reference to ``engine.slots`` (:issue:`464`)
|
||||
- Do not try to call ``body_as_unicode()`` on a non-TextResponse instance (:issue:`462`)
|
||||
- Warn when subclassing XPathItemLoader, previously it only warned on
|
||||
instantiation. (:issue:`523`)
|
||||
- Warn when subclassing XPathSelector, previously it only warned on
|
||||
instantiation. (:issue:`537`)
|
||||
- Multiple fixes to memory stats (:issue:`531`, :issue:`530`, :issue:`529`)
|
||||
- Fix overriding url in `FormRequest.from_response()` (:issue:`507`)
|
||||
- Fix overriding url in ``FormRequest.from_response()`` (:issue:`507`)
|
||||
- Fix tests runner under pip 1.5 (:issue:`513`)
|
||||
- Fix logging error when spider name is unicode (:issue:`479`)
|
||||
|
||||
@ -1833,7 +1833,7 @@ Enhancements
|
||||
(modifying them had been deprecated for a long time)
|
||||
- :setting:`ITEM_PIPELINES` is now defined as a dict (instead of a list)
|
||||
- Sitemap spider can fetch alternate URLs (:issue:`360`)
|
||||
- `Selector.remove_namespaces()` now remove namespaces from element's attributes. (:issue:`416`)
|
||||
- ``Selector.remove_namespaces()`` now remove namespaces from element's attributes. (:issue:`416`)
|
||||
- Paved the road for Python 3.3+ (:issue:`435`, :issue:`436`, :issue:`431`, :issue:`452`)
|
||||
- New item exporter using native python types with nesting support (:issue:`366`)
|
||||
- Tune HTTP1.1 pool size so it matches concurrency defined by settings (:commit:`b43b5f575`)
|
||||
@ -1844,13 +1844,13 @@ Enhancements
|
||||
- Mock server (used for tests) can listen for HTTPS requests (:issue:`410`)
|
||||
- Remove multi spider support from multiple core components
|
||||
(:issue:`422`, :issue:`421`, :issue:`420`, :issue:`419`, :issue:`423`, :issue:`418`)
|
||||
- Travis-CI now tests Scrapy changes against development versions of `w3lib` and `queuelib` python packages.
|
||||
- Travis-CI now tests Scrapy changes against development versions of ``w3lib`` and ``queuelib`` python packages.
|
||||
- Add pypy 2.1 to continuous integration tests (:commit:`ecfa7431`)
|
||||
- Pylinted, pep8 and removed old-style exceptions from source (:issue:`430`, :issue:`432`)
|
||||
- Use importlib for parametric imports (:issue:`445`)
|
||||
- Handle a regression introduced in Python 2.7.5 that affects XmlItemExporter (:issue:`372`)
|
||||
- Bugfix crawling shutdown on SIGINT (:issue:`450`)
|
||||
- Do not submit `reset` type inputs in FormRequest.from_response (:commit:`b326b87`)
|
||||
- Do not submit ``reset`` type inputs in FormRequest.from_response (:commit:`b326b87`)
|
||||
- Do not silence download errors when request errback raises an exception (:commit:`684cfc0`)
|
||||
|
||||
Bugfixes
|
||||
@ -1865,8 +1865,8 @@ Bugfixes
|
||||
- Improve request-response docs (:issue:`391`)
|
||||
- Improve best practices docs (:issue:`399`, :issue:`400`, :issue:`401`, :issue:`402`)
|
||||
- Improve django integration docs (:issue:`404`)
|
||||
- Document `bindaddress` request meta (:commit:`37c24e01d7`)
|
||||
- Improve `Request` class documentation (:issue:`226`)
|
||||
- Document ``bindaddress`` request meta (:commit:`37c24e01d7`)
|
||||
- Improve ``Request`` class documentation (:issue:`226`)
|
||||
|
||||
Other
|
||||
~~~~~
|
||||
@ -1875,7 +1875,7 @@ Other
|
||||
- Add `cssselect`_ python package as install dependency
|
||||
- Drop libxml2 and multi selector's backend support, `lxml`_ is required from now on.
|
||||
- Minimum Twisted version increased to 10.0.0, dropped Twisted 8.0 support.
|
||||
- Running test suite now requires `mock` python library (:issue:`390`)
|
||||
- Running test suite now requires ``mock`` python library (:issue:`390`)
|
||||
|
||||
|
||||
Thanks
|
||||
@ -1929,7 +1929,7 @@ Scrapy 0.18.3 (released 2013-10-03)
|
||||
Scrapy 0.18.2 (released 2013-09-03)
|
||||
-----------------------------------
|
||||
|
||||
- Backport `scrapy check` command fixes and backward compatible multi
|
||||
- Backport ``scrapy check`` command fixes and backward compatible multi
|
||||
crawler process(:issue:`339`)
|
||||
|
||||
Scrapy 0.18.1 (released 2013-08-27)
|
||||
@ -1958,31 +1958,31 @@ Scrapy 0.18.0 (released 2013-08-09)
|
||||
- Handle GET parameters for AJAX crawleable urls (:commit:`3fe2a32`)
|
||||
- Use lxml recover option to parse sitemaps (:issue:`347`)
|
||||
- Bugfix cookie merging by hostname and not by netloc (:issue:`352`)
|
||||
- Support disabling `HttpCompressionMiddleware` using a flag setting (:issue:`359`)
|
||||
- Support xml namespaces using `iternodes` parser in `XMLFeedSpider` (:issue:`12`)
|
||||
- Support `dont_cache` request meta flag (:issue:`19`)
|
||||
- Bugfix `scrapy.utils.gz.gunzip` broken by changes in python 2.7.4 (:commit:`4dc76e`)
|
||||
- Bugfix url encoding on `SgmlLinkExtractor` (:issue:`24`)
|
||||
- Bugfix `TakeFirst` processor shouldn't discard zero (0) value (:issue:`59`)
|
||||
- Support disabling ``HttpCompressionMiddleware`` using a flag setting (:issue:`359`)
|
||||
- Support xml namespaces using ``iternodes`` parser in ``XMLFeedSpider`` (:issue:`12`)
|
||||
- Support ``dont_cache`` request meta flag (:issue:`19`)
|
||||
- Bugfix ``scrapy.utils.gz.gunzip`` broken by changes in python 2.7.4 (:commit:`4dc76e`)
|
||||
- Bugfix url encoding on ``SgmlLinkExtractor`` (:issue:`24`)
|
||||
- Bugfix ``TakeFirst`` processor shouldn't discard zero (0) value (:issue:`59`)
|
||||
- Support nested items in xml exporter (:issue:`66`)
|
||||
- Improve cookies handling performance (:issue:`77`)
|
||||
- Log dupe filtered requests once (:issue:`105`)
|
||||
- Split redirection middleware into status and meta based middlewares (:issue:`78`)
|
||||
- Use HTTP1.1 as default downloader handler (:issue:`109` and :issue:`318`)
|
||||
- Support xpath form selection on `FormRequest.from_response` (:issue:`185`)
|
||||
- Bugfix unicode decoding error on `SgmlLinkExtractor` (:issue:`199`)
|
||||
- Support xpath form selection on ``FormRequest.from_response`` (:issue:`185`)
|
||||
- Bugfix unicode decoding error on ``SgmlLinkExtractor`` (:issue:`199`)
|
||||
- Bugfix signal dispatching on pypi interpreter (:issue:`205`)
|
||||
- Improve request delay and concurrency handling (:issue:`206`)
|
||||
- Add RFC2616 cache policy to `HttpCacheMiddleware` (:issue:`212`)
|
||||
- Add RFC2616 cache policy to ``HttpCacheMiddleware`` (:issue:`212`)
|
||||
- Allow customization of messages logged by engine (:issue:`214`)
|
||||
- Multiples improvements to `DjangoItem` (:issue:`217`, :issue:`218`, :issue:`221`)
|
||||
- Multiples improvements to ``DjangoItem`` (:issue:`217`, :issue:`218`, :issue:`221`)
|
||||
- Extend Scrapy commands using setuptools entry points (:issue:`260`)
|
||||
- Allow spider `allowed_domains` value to be set/tuple (:issue:`261`)
|
||||
- Support `settings.getdict` (:issue:`269`)
|
||||
- Simplify internal `scrapy.core.scraper` slot handling (:issue:`271`)
|
||||
- Added `Item.copy` (:issue:`290`)
|
||||
- Allow spider ``allowed_domains`` value to be set/tuple (:issue:`261`)
|
||||
- Support ``settings.getdict`` (:issue:`269`)
|
||||
- Simplify internal ``scrapy.core.scraper`` slot handling (:issue:`271`)
|
||||
- Added ``Item.copy`` (:issue:`290`)
|
||||
- Collect idle downloader slots (:issue:`297`)
|
||||
- Add `ftp://` scheme downloader handler (:issue:`329`)
|
||||
- Add ``ftp://`` scheme downloader handler (:issue:`329`)
|
||||
- Added downloader benchmark webserver and spider tools :ref:`benchmarking`
|
||||
- Moved persistent (on disk) queues to a separate project (queuelib_) which scrapy now depends on
|
||||
- Add scrapy commands using external libraries (:issue:`260`)
|
||||
@ -2113,7 +2113,7 @@ Scrapy changes:
|
||||
- dropped Signals singleton. Signals should now be accesed through the Crawler.signals attribute. See the signals documentation for more info.
|
||||
- dropped Stats Collector singleton. Stats can now be accessed through the Crawler.stats attribute. See the stats collection documentation for more info.
|
||||
- documented :ref:`topics-api`
|
||||
- `lxml` is now the default selectors backend instead of `libxml2`
|
||||
- ``lxml`` is now the default selectors backend instead of ``libxml2``
|
||||
- ported FormRequest.from_response() to use `lxml`_ instead of `ClientForm`_
|
||||
- removed modules: ``scrapy.xlib.BeautifulSoup`` and ``scrapy.xlib.ClientForm``
|
||||
- SitemapSpider: added support for sitemap urls ending in .xml and .xml.gz, even if they advertise a wrong content type (:commit:`10ed28b`)
|
||||
@ -2206,16 +2206,16 @@ New features and settings
|
||||
- New ``ChunkedTransferMiddleware`` (enabled by default) to support `chunked transfer encoding`_ (:rev:`2769`)
|
||||
- Add boto 2.0 support for S3 downloader handler (:rev:`2763`)
|
||||
- Added `marshal`_ to formats supported by feed exports (:rev:`2744`)
|
||||
- In request errbacks, offending requests are now received in `failure.request` attribute (:rev:`2738`)
|
||||
- In request errbacks, offending requests are now received in ``failure.request`` attribute (:rev:`2738`)
|
||||
- Big downloader refactoring to support per domain/ip concurrency limits (:rev:`2732`)
|
||||
- ``CONCURRENT_REQUESTS_PER_SPIDER`` setting has been deprecated and replaced by:
|
||||
- :setting:`CONCURRENT_REQUESTS`, :setting:`CONCURRENT_REQUESTS_PER_DOMAIN`, :setting:`CONCURRENT_REQUESTS_PER_IP`
|
||||
- check the documentation for more details
|
||||
- Added builtin caching DNS resolver (:rev:`2728`)
|
||||
- Moved Amazon AWS-related components/extensions (SQS spider queue, SimpleDB stats collector) to a separate project: [scaws](https://github.com/scrapinghub/scaws) (:rev:`2706`, :rev:`2714`)
|
||||
- Moved spider queues to scrapyd: `scrapy.spiderqueue` -> `scrapyd.spiderqueue` (:rev:`2708`)
|
||||
- Moved sqlite utils to scrapyd: `scrapy.utils.sqlite` -> `scrapyd.sqlite` (:rev:`2781`)
|
||||
- Real support for returning iterators on `start_requests()` method. The iterator is now consumed during the crawl when the spider is getting idle (:rev:`2704`)
|
||||
- Moved spider queues to scrapyd: ``scrapy.spiderqueue`` -> ``scrapyd.spiderqueue`` (:rev:`2708`)
|
||||
- Moved sqlite utils to scrapyd: ``scrapy.utils.sqlite`` -> ``scrapyd.sqlite`` (:rev:`2781`)
|
||||
- Real support for returning iterators on ``start_requests()`` method. The iterator is now consumed during the crawl when the spider is getting idle (:rev:`2704`)
|
||||
- Added :setting:`REDIRECT_ENABLED` setting to quickly enable/disable the redirect middleware (:rev:`2697`)
|
||||
- Added :setting:`RETRY_ENABLED` setting to quickly enable/disable the retry middleware (:rev:`2694`)
|
||||
- Added ``CloseSpider`` exception to manually close spiders (:rev:`2691`)
|
||||
@ -2223,19 +2223,19 @@ New features and settings
|
||||
- Refactored close spider behavior to wait for all downloads to finish and be processed by spiders, before closing the spider (:rev:`2688`)
|
||||
- Added ``SitemapSpider`` (see documentation in Spiders page) (:rev:`2658`)
|
||||
- Added ``LogStats`` extension for periodically logging basic stats (like crawled pages and scraped items) (:rev:`2657`)
|
||||
- Make handling of gzipped responses more robust (#319, :rev:`2643`). Now Scrapy will try and decompress as much as possible from a gzipped response, instead of failing with an `IOError`.
|
||||
- Make handling of gzipped responses more robust (#319, :rev:`2643`). Now Scrapy will try and decompress as much as possible from a gzipped response, instead of failing with an ``IOError``.
|
||||
- Simplified !MemoryDebugger extension to use stats for dumping memory debugging info (:rev:`2639`)
|
||||
- Added new command to edit spiders: ``scrapy edit`` (:rev:`2636`) and `-e` flag to `genspider` command that uses it (:rev:`2653`)
|
||||
- Added new command to edit spiders: ``scrapy edit`` (:rev:`2636`) and ``-e`` flag to ``genspider`` command that uses it (:rev:`2653`)
|
||||
- Changed default representation of items to pretty-printed dicts. (:rev:`2631`). This improves default logging by making log more readable in the default case, for both Scraped and Dropped lines.
|
||||
- Added :signal:`spider_error` signal (:rev:`2628`)
|
||||
- Added :setting:`COOKIES_ENABLED` setting (:rev:`2625`)
|
||||
- Stats are now dumped to Scrapy log (default value of :setting:`STATS_DUMP` setting has been changed to `True`). This is to make Scrapy users more aware of Scrapy stats and the data that is collected there.
|
||||
- Stats are now dumped to Scrapy log (default value of :setting:`STATS_DUMP` setting has been changed to ``True``). This is to make Scrapy users more aware of Scrapy stats and the data that is collected there.
|
||||
- Added support for dynamically adjusting download delay and maximum concurrent requests (:rev:`2599`)
|
||||
- Added new DBM HTTP cache storage backend (:rev:`2576`)
|
||||
- Added ``listjobs.json`` API to Scrapyd (:rev:`2571`)
|
||||
- ``CsvItemExporter``: added ``join_multivalued`` parameter (:rev:`2578`)
|
||||
- Added namespace support to ``xmliter_lxml`` (:rev:`2552`)
|
||||
- Improved cookies middleware by making `COOKIES_DEBUG` nicer and documenting it (:rev:`2579`)
|
||||
- Improved cookies middleware by making ``COOKIES_DEBUG`` nicer and documenting it (:rev:`2579`)
|
||||
- Several improvements to Scrapyd and Link extractors
|
||||
|
||||
Code rearranged and removed
|
||||
@ -2249,11 +2249,11 @@ Code rearranged and removed
|
||||
- Reduced Scrapy codebase by striping part of Scrapy code into two new libraries:
|
||||
- `w3lib`_ (several functions from ``scrapy.utils.{http,markup,multipart,response,url}``, done in :rev:`2584`)
|
||||
- `scrapely`_ (was ``scrapy.contrib.ibl``, done in :rev:`2586`)
|
||||
- Removed unused function: `scrapy.utils.request.request_info()` (:rev:`2577`)
|
||||
- Removed googledir project from `examples/googledir`. There's now a new example project called `dirbot` available on github: https://github.com/scrapy/dirbot
|
||||
- Removed unused function: ``scrapy.utils.request.request_info()`` (:rev:`2577`)
|
||||
- Removed googledir project from ``examples/googledir``. There's now a new example project called ``dirbot`` available on github: https://github.com/scrapy/dirbot
|
||||
- Removed support for default field values in Scrapy items (:rev:`2616`)
|
||||
- Removed experimental crawlspider v2 (:rev:`2632`)
|
||||
- Removed scheduler middleware to simplify architecture. Duplicates filter is now done in the scheduler itself, using the same dupe fltering class as before (`DUPEFILTER_CLASS` setting) (:rev:`2640`)
|
||||
- Removed scheduler middleware to simplify architecture. Duplicates filter is now done in the scheduler itself, using the same dupe fltering class as before (``DUPEFILTER_CLASS`` setting) (:rev:`2640`)
|
||||
- Removed support for passing urls to ``scrapy crawl`` command (use ``scrapy parse`` instead) (:rev:`2704`)
|
||||
- Removed deprecated Execution Queue (:rev:`2704`)
|
||||
- Removed (undocumented) spider context extension (from scrapy.contrib.spidercontext) (:rev:`2780`)
|
||||
@ -2289,13 +2289,13 @@ Scrapyd changes
|
||||
- Scrapyd now uses one process per spider
|
||||
- It stores one log file per spider run, and rotate them keeping the lastest 5 logs per spider (by default)
|
||||
- A minimal web ui was added, available at http://localhost:6800 by default
|
||||
- There is now a `scrapy server` command to start a Scrapyd server of the current project
|
||||
- There is now a ``scrapy server`` command to start a Scrapyd server of the current project
|
||||
|
||||
Changes to settings
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
- added `HTTPCACHE_ENABLED` setting (False by default) to enable HTTP cache middleware
|
||||
- changed `HTTPCACHE_EXPIRATION_SECS` semantics: now zero means "never expire".
|
||||
- added ``HTTPCACHE_ENABLED`` setting (False by default) to enable HTTP cache middleware
|
||||
- changed ``HTTPCACHE_EXPIRATION_SECS`` semantics: now zero means "never expire".
|
||||
|
||||
Deprecated/obsoleted functionality
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
@ -2326,17 +2326,17 @@ New features and improvements
|
||||
- Splitted Debian package into two packages - the library and the service (#187)
|
||||
- Scrapy log refactoring (#188)
|
||||
- New extension for keeping persistent spider contexts among different runs (#203)
|
||||
- Added `dont_redirect` request.meta key for avoiding redirects (#233)
|
||||
- Added `dont_retry` request.meta key for avoiding retries (#234)
|
||||
- Added ``dont_redirect`` request.meta key for avoiding redirects (#233)
|
||||
- Added ``dont_retry`` request.meta key for avoiding retries (#234)
|
||||
|
||||
Command-line tool changes
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
- New `scrapy` command which replaces the old `scrapy-ctl.py` (#199)
|
||||
- there is only one global `scrapy` command now, instead of one `scrapy-ctl.py` per project
|
||||
- Added `scrapy.bat` script for running more conveniently from Windows
|
||||
- New ``scrapy`` command which replaces the old ``scrapy-ctl.py`` (#199)
|
||||
- there is only one global ``scrapy`` command now, instead of one ``scrapy-ctl.py`` per project
|
||||
- Added ``scrapy.bat`` script for running more conveniently from Windows
|
||||
- Added bash completion to command-line tool (#210)
|
||||
- Renamed command `start` to `runserver` (#209)
|
||||
- Renamed command ``start`` to ``runserver`` (#209)
|
||||
|
||||
API changes
|
||||
~~~~~~~~~~~
|
||||
|
@ -94,7 +94,7 @@ how you :ref:`configure the downloader middlewares
|
||||
.. method:: crawl(\*args, \**kwargs)
|
||||
|
||||
Starts the crawler by instantiating its spider class with the given
|
||||
`args` and `kwargs` arguments, while setting the execution engine in
|
||||
``args`` and ``kwargs`` arguments, while setting the execution engine in
|
||||
motion.
|
||||
|
||||
Returns a deferred that is fired when the crawl is finished.
|
||||
@ -180,7 +180,7 @@ SpiderLoader API
|
||||
.. method:: load(spider_name)
|
||||
|
||||
Get the Spider class with the given name. It'll look into the previously
|
||||
loaded spiders for a spider class with name `spider_name` and will raise
|
||||
loaded spiders for a spider class with name ``spider_name`` and will raise
|
||||
a KeyError if not found.
|
||||
|
||||
:param spider_name: spider class name
|
||||
|
@ -172,5 +172,5 @@ links:
|
||||
|
||||
.. _Twisted: https://twistedmatrix.com/trac/
|
||||
.. _Introduction to Deferreds in Twisted: https://twistedmatrix.com/documents/current/core/howto/defer-intro.html
|
||||
.. _Twisted - hello, asynchronous programming: http://jessenoller.com/2009/02/11/twisted-hello-asynchronous-programming/
|
||||
.. _Twisted - hello, asynchronous programming: http://jessenoller.com/blog/2009/02/11/twisted-hello-asynchronous-programming/
|
||||
.. _Twisted Introduction - Krondo: http://krondo.com/an-introduction-to-asynchronous-programming-and-twisted/
|
||||
|
@ -233,7 +233,7 @@ also request each page to get every quote on the site::
|
||||
name = 'quote'
|
||||
allowed_domains = ['quotes.toscrape.com']
|
||||
page = 1
|
||||
start_urls = ['http://quotes.toscrape.com/api/quotes?page=1]
|
||||
start_urls = ['http://quotes.toscrape.com/api/quotes?page=1']
|
||||
|
||||
def parse(self, response):
|
||||
data = json.loads(response.text)
|
||||
|
@ -41,7 +41,7 @@ previous (or subsequent) middleware being applied.
|
||||
|
||||
If you want to disable a built-in middleware (the ones defined in
|
||||
:setting:`DOWNLOADER_MIDDLEWARES_BASE` and enabled by default) you must define it
|
||||
in your project's :setting:`DOWNLOADER_MIDDLEWARES` setting and assign `None`
|
||||
in your project's :setting:`DOWNLOADER_MIDDLEWARES` setting and assign ``None``
|
||||
as its value. For example, if you want to disable the user-agent middleware::
|
||||
|
||||
DOWNLOADER_MIDDLEWARES = {
|
||||
@ -357,7 +357,7 @@ HttpCacheMiddleware
|
||||
|
||||
.. reqmeta:: dont_cache
|
||||
|
||||
You can also avoid caching a response on every policy using :reqmeta:`dont_cache` meta key equals `True`.
|
||||
You can also avoid caching a response on every policy using :reqmeta:`dont_cache` meta key equals ``True``.
|
||||
|
||||
.. _httpcache-policy-dummy:
|
||||
|
||||
@ -390,17 +390,17 @@ runs to avoid downloading unmodified data (to save bandwidth and speed up crawls
|
||||
|
||||
what is implemented:
|
||||
|
||||
* Do not attempt to store responses/requests with `no-store` cache-control directive set
|
||||
* Do not serve responses from cache if `no-cache` cache-control directive is set even for fresh responses
|
||||
* Compute freshness lifetime from `max-age` cache-control directive
|
||||
* Compute freshness lifetime from `Expires` response header
|
||||
* Compute freshness lifetime from `Last-Modified` response header (heuristic used by Firefox)
|
||||
* Compute current age from `Age` response header
|
||||
* Compute current age from `Date` header
|
||||
* Revalidate stale responses based on `Last-Modified` response header
|
||||
* Revalidate stale responses based on `ETag` response header
|
||||
* Set `Date` header for any received response missing it
|
||||
* Support `max-stale` cache-control directive in requests
|
||||
* Do not attempt to store responses/requests with ``no-store`` cache-control directive set
|
||||
* Do not serve responses from cache if ``no-cache`` cache-control directive is set even for fresh responses
|
||||
* Compute freshness lifetime from ``max-age`` cache-control directive
|
||||
* Compute freshness lifetime from ``Expires`` response header
|
||||
* Compute freshness lifetime from ``Last-Modified`` response header (heuristic used by Firefox)
|
||||
* Compute current age from ``Age`` response header
|
||||
* Compute current age from ``Date`` header
|
||||
* Revalidate stale responses based on ``Last-Modified`` response header
|
||||
* Revalidate stale responses based on ``ETag`` response header
|
||||
* Set ``Date`` header for any received response missing it
|
||||
* Support ``max-stale`` cache-control directive in requests
|
||||
|
||||
This allows spiders to be configured with the full RFC2616 cache policy,
|
||||
but avoid revalidation on a request-by-request basis, while remaining
|
||||
@ -408,15 +408,15 @@ what is implemented:
|
||||
|
||||
Example:
|
||||
|
||||
Add `Cache-Control: max-stale=600` to Request headers to accept responses that
|
||||
Add ``Cache-Control: max-stale=600`` to Request headers to accept responses that
|
||||
have exceeded their expiration time by no more than 600 seconds.
|
||||
|
||||
See also: RFC2616, 14.9.3
|
||||
|
||||
what is missing:
|
||||
|
||||
* `Pragma: no-cache` support https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9.1
|
||||
* `Vary` header support https://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.6
|
||||
* ``Pragma: no-cache`` support https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9.1
|
||||
* ``Vary`` header support https://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.6
|
||||
* Invalidation after updates or deletes https://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.10
|
||||
* ... probably others ..
|
||||
|
||||
@ -626,12 +626,12 @@ Default: ``False``
|
||||
If enabled, will cache pages unconditionally.
|
||||
|
||||
A spider may wish to have all responses available in the cache, for
|
||||
future use with `Cache-Control: max-stale`, for instance. The
|
||||
future use with ``Cache-Control: max-stale``, for instance. The
|
||||
DummyPolicy caches all responses but never revalidates them, and
|
||||
sometimes a more nuanced policy is desirable.
|
||||
|
||||
This setting still respects `Cache-Control: no-store` directives in responses.
|
||||
If you don't want that, filter `no-store` out of the Cache-Control headers in
|
||||
This setting still respects ``Cache-Control: no-store`` directives in responses.
|
||||
If you don't want that, filter ``no-store`` out of the Cache-Control headers in
|
||||
responses you feedto the cache middleware.
|
||||
|
||||
.. setting:: HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS
|
||||
@ -834,8 +834,6 @@ RetryMiddleware
|
||||
|
||||
Failed pages are collected on the scraping process and rescheduled at the
|
||||
end, once the spider has finished crawling all regular (non failed) pages.
|
||||
Once there are no more failed pages to retry, this middleware sends a signal
|
||||
(retry_complete), so other extensions could connect to that signal.
|
||||
|
||||
The :class:`RetryMiddleware` can be configured through the following
|
||||
settings (see the settings documentation for more info):
|
||||
@ -940,7 +938,7 @@ UserAgentMiddleware
|
||||
|
||||
Middleware that allows spiders to override the default user agent.
|
||||
|
||||
In order for a spider to override the default user agent, its `user_agent`
|
||||
In order for a spider to override the default user agent, its ``user_agent``
|
||||
attribute must be set.
|
||||
|
||||
.. _ajaxcrawl-middleware:
|
||||
|
@ -303,7 +303,7 @@ CsvItemExporter
|
||||
|
||||
The additional keyword arguments of this constructor are passed to the
|
||||
:class:`BaseItemExporter` constructor, and the leftover arguments to the
|
||||
`csv.writer`_ constructor, so you can use any `csv.writer` constructor
|
||||
`csv.writer`_ constructor, so you can use any ``csv.writer`` constructor
|
||||
argument to customize this exporter.
|
||||
|
||||
A typical output of this exporter would be::
|
||||
|
@ -19,7 +19,7 @@ settings, just like any other Scrapy code.
|
||||
It is customary for extensions to prefix their settings with their own name, to
|
||||
avoid collision with existing (and future) extensions. For example, a
|
||||
hypothetic extension to handle `Google Sitemaps`_ would use settings like
|
||||
`GOOGLESITEMAP_ENABLED`, `GOOGLESITEMAP_DEPTH`, and so on.
|
||||
``GOOGLESITEMAP_ENABLED``, ``GOOGLESITEMAP_DEPTH``, and so on.
|
||||
|
||||
.. _Google Sitemaps: https://en.wikipedia.org/wiki/Sitemaps
|
||||
|
||||
@ -368,7 +368,7 @@ Invokes a `Python debugger`_ inside a running Scrapy process when a `SIGUSR2`_
|
||||
signal is received. After the debugger is exited, the Scrapy process continues
|
||||
running normally.
|
||||
|
||||
For more info see `Debugging in Python`.
|
||||
For more info see `Debugging in Python`_.
|
||||
|
||||
This extension only works on POSIX-compliant platforms (ie. not Windows).
|
||||
|
||||
|
@ -71,7 +71,7 @@ on cookies.
|
||||
Request serialization
|
||||
---------------------
|
||||
|
||||
Requests must be serializable by the `pickle` module, in order for persistence
|
||||
Requests must be serializable by the ``pickle`` module, in order for persistence
|
||||
to work, so you should make sure that your requests are serializable.
|
||||
|
||||
The most common issue here is to use ``lambda`` functions on request callbacks that
|
||||
|
@ -286,7 +286,7 @@ ItemLoader objects
|
||||
given, one is instantiated automatically using the class in
|
||||
:attr:`default_item_class`.
|
||||
|
||||
When instantiated with a `selector` or a `response` parameters
|
||||
When instantiated with a ``selector`` or a ``response`` parameters
|
||||
the :class:`ItemLoader` class provides convenient mechanisms for extracting
|
||||
data from web pages using :ref:`selectors <topics-selectors>`.
|
||||
|
||||
|
@ -243,7 +243,7 @@ scrapy.utils.log module
|
||||
case, its usage is not required but it's recommended.
|
||||
|
||||
If you plan on configuring the handlers yourself is still recommended you
|
||||
call this function, passing `install_root_handler=False`. Bear in mind
|
||||
call this function, passing ``install_root_handler=False``. Bear in mind
|
||||
there won't be any log output set by default in that case.
|
||||
|
||||
To get you started on manually configuring logging's output, you can use
|
||||
|
@ -132,7 +132,7 @@ For example, the following image URL::
|
||||
|
||||
http://www.example.com/image.jpg
|
||||
|
||||
Whose `SHA1 hash` is::
|
||||
Whose ``SHA1 hash`` is::
|
||||
|
||||
3afec3b4765f8f0a07b78f98c07b83f013567a0a
|
||||
|
||||
|
@ -80,7 +80,7 @@ returned by the :meth:`CrawlerRunner.crawl
|
||||
<scrapy.crawler.CrawlerRunner.crawl>` method.
|
||||
|
||||
Here's an example of its usage, along with a callback to manually stop the
|
||||
reactor after `MySpider` has finished running.
|
||||
reactor after ``MySpider`` has finished running.
|
||||
|
||||
::
|
||||
|
||||
|
@ -50,7 +50,7 @@ Request objects
|
||||
:type meta: dict
|
||||
|
||||
:param body: the request body. If a ``unicode`` is passed, then it's encoded to
|
||||
``str`` using the `encoding` passed (which defaults to ``utf-8``). If
|
||||
``str`` using the ``encoding`` passed (which defaults to ``utf-8``). If
|
||||
``body`` is not given, an empty string is stored. Regardless of the
|
||||
type of this argument, the final value stored will be a ``str`` (never
|
||||
``unicode`` or ``None``).
|
||||
@ -610,7 +610,7 @@ Response objects
|
||||
.. attribute:: Response.flags
|
||||
|
||||
A list that contains flags for this response. Flags are labels used for
|
||||
tagging Responses. For example: `'cached'`, `'redirected`', etc. And
|
||||
tagging Responses. For example: ``'cached'``, ``'redirected``', etc. And
|
||||
they're shown on the string representation of the Response (`__str__`
|
||||
method) which is used by the engine for logging.
|
||||
|
||||
@ -682,7 +682,7 @@ TextResponse objects
|
||||
|
||||
``unicode(response.body)`` is not a correct way to convert response
|
||||
body to unicode: you would be using the system default encoding
|
||||
(typically `ascii`) instead of the response encoding.
|
||||
(typically ``ascii``) instead of the response encoding.
|
||||
|
||||
|
||||
.. attribute:: TextResponse.encoding
|
||||
@ -690,7 +690,7 @@ TextResponse objects
|
||||
A string with the encoding of this response. The encoding is resolved by
|
||||
trying the following mechanisms, in order:
|
||||
|
||||
1. the encoding passed in the constructor `encoding` argument
|
||||
1. the encoding passed in the constructor ``encoding`` argument
|
||||
|
||||
2. the encoding declared in the Content-Type HTTP header. If this
|
||||
encoding is not valid (ie. unknown), it is ignored and the next
|
||||
|
@ -96,7 +96,7 @@ Constructing from response - :class:`~scrapy.http.HtmlResponse` is one of
|
||||
Using selectors
|
||||
---------------
|
||||
|
||||
To explain how to use the selectors we'll use the `Scrapy shell` (which
|
||||
To explain how to use the selectors we'll use the ``Scrapy shell`` (which
|
||||
provides interactive testing) and an example page located in the Scrapy
|
||||
documentation server:
|
||||
|
||||
|
@ -331,16 +331,16 @@ Default: ``0``
|
||||
|
||||
Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``
|
||||
|
||||
An integer that is used to adjust the request priority based on its depth:
|
||||
An integer that is used to adjust the :attr:`~scrapy.http.Request.priority` of
|
||||
a :class:`~scrapy.http.Request` based on its depth.
|
||||
|
||||
- if zero (default), no priority adjustment is made from depth
|
||||
- **a positive value will decrease the priority, i.e. higher depth
|
||||
requests will be processed later** ; this is commonly used when doing
|
||||
breadth-first crawls (BFO)
|
||||
- a negative value will increase priority, i.e., higher depth requests
|
||||
will be processed sooner (DFO)
|
||||
The priority of a request is adjusted as follows::
|
||||
|
||||
See also: :ref:`faq-bfo-dfo` about tuning Scrapy for BFO or DFO.
|
||||
request.priority = request.priority - ( depth * DEPTH_PRIORITY )
|
||||
|
||||
As depth increases, positive values of ``DEPTH_PRIORITY`` decrease request
|
||||
priority (BFO), while negative values increase request priority (DFO). See
|
||||
also :ref:`faq-bfo-dfo`.
|
||||
|
||||
.. note::
|
||||
|
||||
@ -599,7 +599,7 @@ The amount of time (in secs) that the downloader will wait before timing out.
|
||||
DOWNLOAD_MAXSIZE
|
||||
----------------
|
||||
|
||||
Default: `1073741824` (1024MB)
|
||||
Default: ``1073741824`` (1024MB)
|
||||
|
||||
The maximum response size (in bytes) that downloader will download.
|
||||
|
||||
@ -620,7 +620,7 @@ If you want to disable it set to 0.
|
||||
DOWNLOAD_WARNSIZE
|
||||
-----------------
|
||||
|
||||
Default: `33554432` (32MB)
|
||||
Default: ``33554432`` (32MB)
|
||||
|
||||
The response size (in bytes) that downloader will start to warn.
|
||||
|
||||
|
@ -43,7 +43,7 @@ previous (or subsequent) middleware being applied.
|
||||
|
||||
If you want to disable a builtin middleware (the ones defined in
|
||||
:setting:`SPIDER_MIDDLEWARES_BASE`, and enabled by default) you must define it
|
||||
in your project :setting:`SPIDER_MIDDLEWARES` setting and assign `None` as its
|
||||
in your project :setting:`SPIDER_MIDDLEWARES` setting and assign ``None`` as its
|
||||
value. For example, if you want to disable the off-site middleware::
|
||||
|
||||
SPIDER_MIDDLEWARES = {
|
||||
@ -200,7 +200,7 @@ DepthMiddleware
|
||||
.. class:: DepthMiddleware
|
||||
|
||||
DepthMiddleware is used for tracking the depth of each Request inside the
|
||||
site being scraped. It works by setting `request.meta['depth'] = 0` whenever
|
||||
site being scraped. It works by setting ``request.meta['depth'] = 0`` whenever
|
||||
there is no value previously set (usually just the first Request) and
|
||||
incrementing it by 1 otherwise.
|
||||
|
||||
|
@ -129,7 +129,7 @@ scrapy.Spider
|
||||
|
||||
You probably won't need to override this directly because the default
|
||||
implementation acts as a proxy to the :meth:`__init__` method, calling
|
||||
it with the given arguments `args` and named arguments `kwargs`.
|
||||
it with the given arguments ``args`` and named arguments ``kwargs``.
|
||||
|
||||
Nonetheless, this method sets the :attr:`crawler` and :attr:`settings`
|
||||
attributes in the new instance so they can be accessed later inside the
|
||||
@ -298,13 +298,13 @@ The above example can also be written as follows::
|
||||
|
||||
Keep in mind that spider arguments are only strings.
|
||||
The spider will not do any parsing on its own.
|
||||
If you were to set the `start_urls` attribute from the command line,
|
||||
If you were to set the ``start_urls`` attribute from the command line,
|
||||
you would have to parse it on your own into a list
|
||||
using something like
|
||||
`ast.literal_eval <https://docs.python.org/library/ast.html#ast.literal_eval>`_
|
||||
or `json.loads <https://docs.python.org/library/json.html#json.loads>`_
|
||||
and then set it as an attribute.
|
||||
Otherwise, you would cause iteration over a `start_urls` string
|
||||
Otherwise, you would cause iteration over a ``start_urls`` string
|
||||
(a very common python pitfall)
|
||||
resulting in each character being seen as a separate url.
|
||||
|
||||
|
@ -22,7 +22,7 @@ To use the packages:
|
||||
|
||||
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7
|
||||
|
||||
2. Create `/etc/apt/sources.list.d/scrapy.list` file using the following command::
|
||||
2. Create ``/etc/apt/sources.list.d/scrapy.list`` file using the following command::
|
||||
|
||||
echo 'deb http://archive.scrapy.org/ubuntu scrapy main' | sudo tee /etc/apt/sources.list.d/scrapy.list
|
||||
|
||||
@ -34,7 +34,7 @@ To use the packages:
|
||||
|
||||
.. note:: Repeat step 3 if you are trying to upgrade Scrapy.
|
||||
|
||||
.. warning:: `python-scrapy` is a different package provided by official debian
|
||||
.. warning:: ``python-scrapy`` is a different package provided by official debian
|
||||
repositories, it's very outdated and it isn't supported by Scrapy team.
|
||||
|
||||
.. _Scrapinghub: https://scrapinghub.com/
|
||||
|
@ -153,7 +153,7 @@ class CrawlerRunner(object):
|
||||
It will call the given Crawler's :meth:`~Crawler.crawl` method, while
|
||||
keeping track of it so it can be stopped later.
|
||||
|
||||
If `crawler_or_spidercls` isn't a :class:`~scrapy.crawler.Crawler`
|
||||
If ``crawler_or_spidercls`` isn't a :class:`~scrapy.crawler.Crawler`
|
||||
instance, this method will try to create one using this parameter as
|
||||
the spider class given to it.
|
||||
|
||||
@ -188,10 +188,10 @@ class CrawlerRunner(object):
|
||||
"""
|
||||
Return a :class:`~scrapy.crawler.Crawler` object.
|
||||
|
||||
* If `crawler_or_spidercls` is a Crawler, it is returned as-is.
|
||||
* If `crawler_or_spidercls` is a Spider subclass, a new Crawler
|
||||
* If ``crawler_or_spidercls`` is a Crawler, it is returned as-is.
|
||||
* If ``crawler_or_spidercls`` is a Spider subclass, a new Crawler
|
||||
is constructed for it.
|
||||
* If `crawler_or_spidercls` is a string, this function finds
|
||||
* If ``crawler_or_spidercls`` is a string, this function finds
|
||||
a spider with this name in a Scrapy project (using spider loader),
|
||||
then creates a Crawler instance for it.
|
||||
"""
|
||||
@ -273,7 +273,7 @@ class CrawlerProcess(CrawlerRunner):
|
||||
:setting:`REACTOR_THREADPOOL_MAXSIZE`, and installs a DNS cache based
|
||||
on :setting:`DNSCACHE_ENABLED` and :setting:`DNSCACHE_SIZE`.
|
||||
|
||||
If `stop_after_crawl` is True, the reactor will be stopped after all
|
||||
If ``stop_after_crawl`` is True, the reactor will be stopped after all
|
||||
crawlers have finished, using :meth:`join`.
|
||||
|
||||
:param boolean stop_after_crawl: stop or not the reactor when all
|
||||
|
@ -7,9 +7,7 @@ RETRY_TIMES - how many times to retry a failed page
|
||||
RETRY_HTTP_CODES - which HTTP response codes to retry
|
||||
|
||||
Failed pages are collected on the scraping process and rescheduled at the end,
|
||||
once the spider has finished crawling all regular (non failed) pages. Once
|
||||
there is no more failed pages to retry this middleware sends a signal
|
||||
(retry_complete), so other extensions could connect to that signal.
|
||||
once the spider has finished crawling all regular (non failed) pages.
|
||||
"""
|
||||
import logging
|
||||
|
||||
|
@ -13,21 +13,21 @@ CRAWLEDMSG = u"Crawled (%(status)s) %(request)s%(request_flags)s (referer: %(ref
|
||||
class LogFormatter(object):
|
||||
"""Class for generating log messages for different actions.
|
||||
|
||||
All methods must return a dictionary listing the parameters `level`, `msg`
|
||||
and `args` which are going to be used for constructing the log message when
|
||||
calling logging.log.
|
||||
All methods must return a dictionary listing the parameters ``level``,
|
||||
``msg`` and ``args`` which are going to be used for constructing the log
|
||||
message when calling logging.log.
|
||||
|
||||
Dictionary keys for the method outputs:
|
||||
* `level` should be the log level for that action, you can use those
|
||||
* ``level`` should be the log level for that action, you can use those
|
||||
from the python logging library: logging.DEBUG, logging.INFO,
|
||||
logging.WARNING, logging.ERROR and logging.CRITICAL.
|
||||
|
||||
* `msg` should be a string that can contain different formatting
|
||||
placeholders. This string, formatted with the provided `args`, is going
|
||||
to be the log message for that action.
|
||||
* ``msg`` should be a string that can contain different formatting
|
||||
placeholders. This string, formatted with the provided ``args``, is
|
||||
going to be the log message for that action.
|
||||
|
||||
* `args` should be a tuple or dict with the formatting placeholders for
|
||||
`msg`. The final log message is computed as output['msg'] %
|
||||
* ``args`` should be a tuple or dict with the formatting placeholders
|
||||
for ``msg``. The final log message is computed as output['msg'] %
|
||||
output['args'].
|
||||
"""
|
||||
|
||||
|
@ -255,13 +255,13 @@ class FilesPipeline(MediaPipeline):
|
||||
doing stat of the files and determining if file is new, uptodate or
|
||||
expired.
|
||||
|
||||
`new` files are those that pipeline never processed and needs to be
|
||||
``new`` files are those that pipeline never processed and needs to be
|
||||
downloaded from supplier site the first time.
|
||||
|
||||
`uptodate` files are the ones that the pipeline processed and are still
|
||||
``uptodate`` files are the ones that the pipeline processed and are still
|
||||
valid files.
|
||||
|
||||
`expired` files are those that pipeline already processed but the last
|
||||
``expired`` files are those that pipeline already processed but the last
|
||||
modification was made long time ago, so a reprocessing is recommended to
|
||||
refresh it in case of change.
|
||||
|
||||
|
@ -2,7 +2,7 @@ from ftplib import error_perm
|
||||
from posixpath import dirname
|
||||
|
||||
def ftp_makedirs_cwd(ftp, path, first_call=True):
|
||||
"""Set the current directory of the FTP connection given in the `ftp`
|
||||
"""Set the current directory of the FTP connection given in the ``ftp``
|
||||
argument (as a ftplib.FTP object), creating all parent directories if they
|
||||
don't exist. The ftplib.FTP object must be already connected and logged in.
|
||||
"""
|
||||
|
@ -32,7 +32,7 @@ class TopLevelFormatter(logging.Filter):
|
||||
|
||||
Since it can't be set for just one logger (it won't propagate for its
|
||||
children), it's going to be set in the root handler, with a parametrized
|
||||
`loggers` list where it should act.
|
||||
``loggers`` list where it should act.
|
||||
"""
|
||||
|
||||
def __init__(self, loggers=None):
|
||||
|
@ -97,8 +97,8 @@ def unicode_to_str(text, encoding=None, errors='strict'):
|
||||
|
||||
|
||||
def to_unicode(text, encoding=None, errors='strict'):
|
||||
"""Return the unicode representation of a bytes object `text`. If `text`
|
||||
is already an unicode object, return it as-is."""
|
||||
"""Return the unicode representation of a bytes object ``text``. If
|
||||
``text`` is already an unicode object, return it as-is."""
|
||||
if isinstance(text, six.text_type):
|
||||
return text
|
||||
if not isinstance(text, (bytes, six.text_type)):
|
||||
@ -110,7 +110,7 @@ def to_unicode(text, encoding=None, errors='strict'):
|
||||
|
||||
|
||||
def to_bytes(text, encoding=None, errors='strict'):
|
||||
"""Return the binary representation of `text`. If `text`
|
||||
"""Return the binary representation of ``text``. If ``text``
|
||||
is already a bytes object, return it as-is."""
|
||||
if isinstance(text, bytes):
|
||||
return text
|
||||
@ -123,7 +123,7 @@ def to_bytes(text, encoding=None, errors='strict'):
|
||||
|
||||
|
||||
def to_native_str(text, encoding=None, errors='strict'):
|
||||
""" Return str representation of `text`
|
||||
""" Return str representation of ``text``
|
||||
(bytes in Python 2.x and unicode in Python 3.x). """
|
||||
if six.PY2:
|
||||
return to_bytes(text, encoding, errors)
|
||||
@ -189,7 +189,7 @@ def isbinarytext(text):
|
||||
|
||||
|
||||
def binary_is_text(data):
|
||||
""" Returns `True` if the given ``data`` argument (a ``bytes`` object)
|
||||
""" Returns ``True`` if the given ``data`` argument (a ``bytes`` object)
|
||||
does not contain unprintable control characters.
|
||||
"""
|
||||
if not isinstance(data, bytes):
|
||||
@ -314,7 +314,7 @@ class WeakKeyCache(object):
|
||||
@deprecated
|
||||
def stringify_dict(dct_or_tuples, encoding='utf-8', keys_only=True):
|
||||
"""Return a (new) dict with unicode keys (and values when "keys_only" is
|
||||
False) of the given dict converted to strings. `dct_or_tuples` can be a
|
||||
False) of the given dict converted to strings. ``dct_or_tuples`` can be a
|
||||
dict or a list of tuples, like any dict constructor supports.
|
||||
"""
|
||||
d = {}
|
||||
@ -357,10 +357,10 @@ def retry_on_eintr(function, *args, **kw):
|
||||
|
||||
|
||||
def without_none_values(iterable):
|
||||
"""Return a copy of `iterable` with all `None` entries removed.
|
||||
"""Return a copy of ``iterable`` with all ``None`` entries removed.
|
||||
|
||||
If `iterable` is a mapping, return a dictionary where all pairs that have
|
||||
value `None` have been removed.
|
||||
If ``iterable`` is a mapping, return a dictionary where all pairs that have
|
||||
value ``None`` have been removed.
|
||||
"""
|
||||
try:
|
||||
return {k: v for k, v in six.iteritems(iterable) if v is not None}
|
||||
|
@ -109,12 +109,12 @@ def strip_url(url, strip_credentials=True, strip_default_port=True, origin_only=
|
||||
|
||||
"""Strip URL string from some of its components:
|
||||
|
||||
- `strip_credentials` removes "user:password@"
|
||||
- `strip_default_port` removes ":80" (resp. ":443", ":21")
|
||||
- ``strip_credentials`` removes "user:password@"
|
||||
- ``strip_default_port`` removes ":80" (resp. ":443", ":21")
|
||||
from http:// (resp. https://, ftp://) URLs
|
||||
- `origin_only` replaces path component with "/", also dropping
|
||||
- ``origin_only`` replaces path component with "/", also dropping
|
||||
query and fragment components ; it also strips credentials
|
||||
- `strip_fragment` drops any #fragment component
|
||||
- ``strip_fragment`` drops any #fragment component
|
||||
"""
|
||||
|
||||
parsed_url = urlparse(url)
|
||||
|
@ -10,7 +10,8 @@ Status Obsolete (discarded)
|
||||
SEP-006: Rename of Selectors to Extractors
|
||||
==========================================
|
||||
|
||||
This SEP proposes a more meaningful naming of XPathSelectors or "Selectors" and their `x` method.
|
||||
This SEP proposes a more meaningful naming of XPathSelectors or "Selectors" and
|
||||
their ``x`` method.
|
||||
|
||||
Motivation
|
||||
==========
|
||||
@ -57,7 +58,7 @@ Additional changes
|
||||
As the name of the method for performing selection (the ``x`` method) is not
|
||||
descriptive nor mnemotechnic enough and clearly clashes with ``extract`` method
|
||||
(x sounds like a short for extract in english), we propose to rename it to
|
||||
`select`, `sel` (is shortness if required), or `xpath` after `lxml's
|
||||
``select``, ``sel`` (is shortness if required), or ``xpath`` after `lxml's
|
||||
<http://lxml.de/xpathxslt.html>`_ ``xpath`` method.
|
||||
|
||||
Bonus (ItemBuilder)
|
||||
|
@ -16,7 +16,7 @@ _DATABASES = collections.defaultdict(DummyDB)
|
||||
def open(file, flag='r', mode=0o666):
|
||||
"""Open or create a dummy database compatible.
|
||||
|
||||
Arguments `flag` and `mode` are ignored.
|
||||
Arguments ``flag`` and ``mode`` are ignored.
|
||||
"""
|
||||
# return same instance for same file argument
|
||||
return _DATABASES[file]
|
||||
|
@ -61,7 +61,7 @@ class ShellTest(ProcessTest, SiteTest, unittest.TestCase):
|
||||
|
||||
@defer.inlineCallbacks
|
||||
def test_fetch_redirect_follow_302(self):
|
||||
"""Test that calling `fetch(url)` follows HTTP redirects by default."""
|
||||
"""Test that calling ``fetch(url)`` follows HTTP redirects by default."""
|
||||
url = self.url('/redirect-no-meta-refresh')
|
||||
code = "fetch('{0}')"
|
||||
errcode, out, errout = yield self.execute(['-c', code.format(url)])
|
||||
@ -71,7 +71,7 @@ class ShellTest(ProcessTest, SiteTest, unittest.TestCase):
|
||||
|
||||
@defer.inlineCallbacks
|
||||
def test_fetch_redirect_not_follow_302(self):
|
||||
"""Test that calling `fetch(url, redirect=False)` disables automatic redirects."""
|
||||
"""Test that calling ``fetch(url, redirect=False)`` disables automatic redirects."""
|
||||
url = self.url('/redirect-no-meta-refresh')
|
||||
code = "fetch('{0}', redirect=False)"
|
||||
errcode, out, errout = yield self.execute(['-c', code.format(url)])
|
||||
|
Loading…
x
Reference in New Issue
Block a user