.. _news: Release notes ============= 0.22.2 (released 2014-02-14) ---------------------------- - fix a reference to unexistent engine.slots. closes #593 (:commit:`13c099a`) - downloaderMW doc typo (spiderMW doc copy remnant) (:commit:`8ae11bf`) - Correct typos (:commit:`1346037`) 0.22.1 (released 2014-02-08) ---------------------------- - localhost666 can resolve under certain circumstances (:commit:`2ec2279`) - test inspect.stack failure (:commit:`cc3eda3`) - Handle cases when inspect.stack() fails (:commit:`8cb44f9`) - Fix wrong checks on subclassing of deprecated classes. closes #581 (:commit:`46d98d6`) - Docs: 4-space indent for final spider example (:commit:`13846de`) - Fix HtmlParserLinkExtractor and tests after #485 merge (:commit:`368a946`) - BaseSgmlLinkExtractor: Fixed the missing space when the link has an inner tag (:commit:`b566388`) - BaseSgmlLinkExtractor: Added unit test of a link with an inner tag (:commit:`c1cb418`) - BaseSgmlLinkExtractor: Fixed unknown_endtag() so that it only set current_link=None when the end tag match the opening tag (:commit:`7e4d627`) - Fix tests for Travis-CI build (:commit:`76c7e20`) - replace unencodeable codepoints with html entities. fixes #562 and #285 (:commit:`5f87b17`) - RegexLinkExtractor: encode URL unicode value when creating Links (:commit:`d0ee545`) - Updated the tutorial crawl output with latest output. (:commit:`8da65de`) - Updated shell docs with the crawler reference and fixed the actual shell output. (:commit:`875b9ab`) - PEP8 minor edits. (:commit:`f89efaf`) - Expose current crawler in the scrapy shell. (:commit:`5349cec`) - Unused re import and PEP8 minor edits. (:commit:`387f414`) - Ignore None's values when using the ItemLoader. (:commit:`0632546`) - DOC Fixed HTTPCACHE_STORAGE typo in the default value which is now Filesystem instead Dbm. (:commit:`cde9a8c`) - show ubuntu setup instructions as literal code (:commit:`fb5c9c5`) - Update Ubuntu installation instructions (:commit:`70fb105`) - Merge pull request #550 from stray-leone/patch-1 (:commit:`6f70b6a`) - modify the version of scrapy ubuntu package (:commit:`725900d`) - fix 0.22.0 release date (:commit:`af0219a`) - fix typos in news.rst and remove (not released yet) header (:commit:`b7f58f4`) 0.22.0 (released 2014-01-17) ---------------------------- Enhancements ~~~~~~~~~~~~ - [**Backwards incompatible**] Switched HTTPCacheMiddleware backend to filesystem (:issue:`541`) To restore old backend set `HTTPCACHE_STORAGE` to `scrapy.contrib.httpcache.DbmCacheStorage` - Proxy \https:// urls using CONNECT method (:issue:`392`, :issue:`397`) - Add a middleware to crawl ajax crawleable pages as defined by google (:issue:`343`) - Rename scrapy.spider.BaseSpider to scrapy.spider.Spider (:issue:`510`, :issue:`519`) - Selectors register EXSLT namespaces by default (:issue:`472`) - Unify item loaders similar to selectors renaming (:issue:`461`) - Make `RFPDupeFilter` class easily subclassable (:issue:`533`) - Improve test coverage and forthcoming Python 3 support (:issue:`525`) - Promote startup info on settings and middleware to INFO level (:issue:`520`) - Support partials in `get_func_args` util (:issue:`506`, issue:`504`) - Allow running indiviual tests via tox (:issue:`503`) - Update extensions ignored by link extractors (:issue:`498`) - Add middleware methods to get files/images/thumbs paths (:issue:`490`) - Improve offsite middleware tests (:issue:`478`) - Add a way to skip default Referer header set by RefererMiddleware (:issue:`475`) - Do not send `x-gzip` in default `Accept-Encoding` header (:issue:`469`) - Support defining http error handling using settings (:issue:`466`) - Use modern python idioms wherever you find legacies (:issue:`497`) - Improve and correct documentation (:issue:`527`, :issue:`524`, :issue:`521`, :issue:`517`, :issue:`512`, :issue:`505`, :issue:`502`, :issue:`489`, :issue:`465`, :issue:`460`, :issue:`425`, :issue:`536`) Fixes ~~~~~ - Update Selector class imports in CrawlSpider template (:issue:`484`) - Fix unexistent reference to `engine.slots` (:issue:`464`) - Do not try to call `body_as_unicode()` on a non-TextResponse instance (:issue:`462`) - Warn when subclassing XPathItemLoader, previously it only warned on instantiation. (:issue:`523`) - Warn when subclassing XPathSelector, previously it only warned on instantiation. (:issue:`537`) - Multiple fixes to memory stats (:issue:`531`, :issue:`530`, :issue:`529`) - Fix overriding url in `FormRequest.from_response()` (:issue:`507`) - Fix tests runner under pip 1.5 (:issue:`513`) - Fix logging error when spider name is unicode (:issue:`479`) 0.20.2 (released 2013-12-09) ---------------------------- - Update CrawlSpider Template with Selector changes (:commit:`6d1457d`) - fix method name in tutorial. closes GH-480 (:commit:`b4fc359` 0.20.1 (released 2013-11-28) ---------------------------- - include_package_data is required to build wheels from published sources (:commit:`5ba1ad5`) - process_parallel was leaking the failures on its internal deferreds. closes #458 (:commit:`419a780`) 0.20.0 (released 2013-11-08) ---------------------------- Enhancements ~~~~~~~~~~~~ - New Selector's API including CSS selectors (:issue:`395` and :issue:`426`), - Request/Response url/body attributes are now immutable (modifying them had been deprecated for a long time) - :setting:`ITEM_PIPELINES` is now defined as a dict (instead of a list) - Sitemap spider can fetch alternate URLs (:issue:`360`) - `Selector.remove_namespaces()` now remove namespaces from element's attributes. (:issue:`416`) - Paved the road for Python 3.3+ (:issue:`435`, :issue:`436`, :issue:`431`, :issue:`452`) - New item exporter using native python types with nesting support (:issue:`366`) - Tune HTTP1.1 pool size so it matches concurrency defined by settings (:commit:`b43b5f575`) - scrapy.mail.MailSender now can connect over TLS or upgrade using STARTTLS (:issue:`327`) - New FilesPipeline with functionality factored out from ImagesPipeline (:issue:`370`, :issue:`409`) - Recommend Pillow instead of PIL for image handling (:issue:`317`) - Added debian packages for Ubuntu quantal and raring (:commit:`86230c0`) - Mock server (used for tests) can listen for HTTPS requests (:issue:`410`) - Remove multi spider support from multiple core components (:issue:`422`, :issue:`421`, :issue:`420`, :issue:`419`, :issue:`423`, :issue:`418`) - Travis-CI now tests Scrapy changes against development versions of `w3lib` and `queuelib` python packages. - Add pypy 2.1 to continuous integration tests (:commit:`ecfa7431`) - Pylinted, pep8 and removed old-style exceptions from source (:issue:`430`, :issue:`432`) - Use importlib for parametric imports (:issue:`445`) - Handle a regression introduced in Python 2.7.5 that affects XmlItemExporter (:issue:`372`) - Bugfix crawling shutdown on SIGINT (:issue:`450`) - Do not submit `reset` type inputs in FormRequest.from_response (:commit:`b326b87`) - Do not silence download errors when request errback raises an exception (:commit:`684cfc0`) Bugfixes ~~~~~~~~ - Fix tests under Django 1.6 (:commit:`b6bed44c`) - Lot of bugfixes to retry middleware under disconnections using HTTP 1.1 download handler - Fix inconsistencies among Twisted releases (:issue:`406`) - Fix scrapy shell bugs (:issue:`418`, :issue:`407`) - Fix invalid variable name in setup.py (:issue:`429`) - Fix tutorial references (:issue:`387`) - Improve request-response docs (:issue:`391`) - Improve best practices docs (:issue:`399`, :issue:`400`, :issue:`401`, :issue:`402`) - Improve django integration docs (:issue:`404`) - Document `bindaddress` request meta (:commit:`37c24e01d7`) - Improve `Request` class documentation (:issue:`226`) Other ~~~~~ - Dropped Python 2.6 support (:issue:`448`) - Add `cssselect`_ python package as install dependency - Drop libxml2 and multi selector's backend support, `lxml`_ is required from now on. - Minimum Twisted version increased to 10.0.0, dropped Twisted 8.0 support. - Running test suite now requires `mock` python library (:issue:`390`) Thanks ~~~~~~ Thanks to everyone who contribute to this release! List of contributors sorted by number of commits:: 69 Daniel Graña 37 Pablo Hoffman 13 Mikhail Korobov 9 Alex Cepoi 9 alexanderlukanin13 8 Rolando Espinoza La fuente 8 Lukasz Biedrycki 6 Nicolas Ramirez 3 Paul Tremberth 2 Martin Olveyra 2 Stefan 2 Rolando Espinoza 2 Loren Davie 2 irgmedeiros 1 Stefan Koch 1 Stefan 1 scraperdragon 1 Kumara Tharmalingam 1 Francesco Piccinno 1 Marcos Campal 1 Dragon Dave 1 Capi Etheriel 1 cacovsky 1 Berend Iwema 0.18.4 (released 2013-10-10) ---------------------------- - IPython refuses to update the namespace. fix #396 (:commit:`3d32c4f`) - Fix AlreadyCalledError replacing a request in shell command. closes #407 (:commit:`b1d8919`) - Fix start_requests laziness and early hangs (:commit:`89faf52`) 0.18.3 (released 2013-10-03) ---------------------------- - fix regression on lazy evaluation of start requests (:commit:`12693a5`) - forms: do not submit reset inputs (:commit:`e429f63`) - increase unittest timeouts to decrease travis false positive failures (:commit:`912202e`) - backport master fixes to json exporter (:commit:`cfc2d46`) - Fix permission and set umask before generating sdist tarball (:commit:`06149e0`) 0.18.2 (released 2013-09-03) ---------------------------- - Backport `scrapy check` command fixes and backward compatible multi crawler process(:issue:`339`) 0.18.1 (released 2013-08-27) ---------------------------- - remove extra import added by cherry picked changes (:commit:`d20304e`) - fix crawling tests under twisted pre 11.0.0 (:commit:`1994f38`) - py26 can not format zero length fields {} (:commit:`abf756f`) - test PotentiaDataLoss errors on unbound responses (:commit:`b15470d`) - Treat responses without content-length or Transfer-Encoding as good responses (:commit:`c4bf324`) - do no include ResponseFailed if http11 handler is not enabled (:commit:`6cbe684`) - New HTTP client wraps connection losts in ResponseFailed exception. fix #373 (:commit:`1a20bba`) - limit travis-ci build matrix (:commit:`3b01bb8`) - Merge pull request #375 from peterarenot/patch-1 (:commit:`fa766d7`) - Fixed so it refers to the correct folder (:commit:`3283809`) - added quantal & raring to support ubuntu releases (:commit:`1411923`) - fix retry middleware which didn't retry certain connection errors after the upgrade to http1 client, closes GH-373 (:commit:`bb35ed0`) - fix XmlItemExporter in Python 2.7.4 and 2.7.5 (:commit:`de3e451`) - minor updates to 0.18 release notes (:commit:`c45e5f1`) - fix contributters list format (:commit:`0b60031`) 0.18.0 (released 2013-08-09) ---------------------------- - Lot of improvements to testsuite run using Tox, including a way to test on pypi - Handle GET parameters for AJAX crawleable urls (:commit:`3fe2a32`) - Use lxml recover option to parse sitemaps (:issue:`347`) - Bugfix cookie merging by hostname and not by netloc (:issue:`352`) - Support disabling `HttpCompressionMiddleware` using a flag setting (:issue:`359`) - Support xml namespaces using `iternodes` parser in `XMLFeedSpider` (:issue:`12`) - Support `dont_cache` request meta flag (:issue:`19`) - Bugfix `scrapy.utils.gz.gunzip` broken by changes in python 2.7.4 (:commit:`4dc76e`) - Bugfix url encoding on `SgmlLinkExtractor` (:issue:`24`) - Bugfix `TakeFirst` processor shouldn't discard zero (0) value (:issue:`59`) - Support nested items in xml exporter (:issue:`66`) - Improve cookies handling performance (:issue:`77`) - Log dupe filtered requests once (:issue:`105`) - Split redirection middleware into status and meta based middlewares (:issue:`78`) - Use HTTP1.1 as default downloader handler (:issue:`109` and :issue:`318`) - Support xpath form selection on `FormRequest.from_response` (:issue:`185`) - Bugfix unicode decoding error on `SgmlLinkExtractor` (:issue:`199`) - Bugfix signal dispatching on pypi interpreter (:issue:`205`) - Improve request delay and concurrency handling (:issue:`206`) - Add RFC2616 cache policy to `HttpCacheMiddleware` (:issue:`212`) - Allow customization of messages logged by engine (:issue:`214`) - Multiples improvements to `DjangoItem` (:issue:`217`, :issue:`218`, :issue:`221`) - Extend Scrapy commands using setuptools entry points (:issue:`260`) - Allow spider `allowed_domains` value to be set/tuple (:issue:`261`) - Support `settings.getdict` (:issue:`269`) - Simplify internal `scrapy.core.scraper` slot handling (:issue:`271`) - Added `Item.copy` (:issue:`290`) - Collect idle downloader slots (:issue:`297`) - Add `ftp://` scheme downloader handler (:issue:`329`) - Added downloader benchmark webserver and spider tools :ref:`benchmarking` - Moved persistent (on disk) queues to a separate project (queuelib_) which scrapy now depends on - Add scrapy commands using external libraries (:issue:`260`) - Added ``--pdb`` option to ``scrapy`` command line tool - Added :meth:`XPathSelector.remove_namespaces` which allows to remove all namespaces from XML documents for convenience (to work with namespace-less XPaths). Documented in :ref:`topics-selectors`. - Several improvements to spider contracts - New default middleware named MetaRefreshMiddldeware that handles meta-refresh html tag redirections, - MetaRefreshMiddldeware and RedirectMiddleware have different priorities to address #62 - added from_crawler method to spiders - added system tests with mock server - more improvements to Mac OS compatibility (thanks Alex Cepoi) - several more cleanups to singletons and multi-spider support (thanks Nicolas Ramirez) - support custom download slots - added --spider option to "shell" command. - log overridden settings when scrapy starts Thanks to everyone who contribute to this release. Here is a list of contributors sorted by number of commits:: 130 Pablo Hoffman 97 Daniel Graña 20 Nicolás Ramírez 13 Mikhail Korobov 12 Pedro Faustino 11 Steven Almeroth 5 Rolando Espinoza La fuente 4 Michal Danilak 4 Alex Cepoi 4 Alexandr N Zamaraev (aka tonal) 3 paul 3 Martin Olveyra 3 Jordi Llonch 3 arijitchakraborty 2 Shane Evans 2 joehillen 2 Hart 2 Dan 1 Zuhao Wan 1 whodatninja 1 vkrest 1 tpeng 1 Tom Mortimer-Jones 1 Rocio Aramberri 1 Pedro 1 notsobad 1 Natan L 1 Mark Grey 1 Luan 1 Libor Nenadál 1 Juan M Uys 1 Jonas Brunsgaard 1 Ilya Baryshev 1 Hasnain Lakhani 1 Emanuel Schorsch 1 Chris Tilden 1 Capi Etheriel 1 cacovsky 1 Berend Iwema 0.16.5 (released 2013-05-30) ---------------------------- - obey request method when scrapy deploy is redirected to a new endpoint (:commit:`8c4fcee`) - fix inaccurate downloader middleware documentation. refs #280 (:commit:`40667cb`) - doc: remove links to diveintopython.org, which is no longer available. closes #246 (:commit:`bd58bfa`) - Find form nodes in invalid html5 documents (:commit:`e3d6945`) - Fix typo labeling attrs type bool instead of list (:commit:`a274276`) 0.16.4 (released 2013-01-23) ---------------------------- - fixes spelling errors in documentation (:commit:`6d2b3aa`) - add doc about disabling an extension. refs #132 (:commit:`c90de33`) - Fixed error message formatting. log.err() doesn't support cool formatting and when error occurred, the message was: "ERROR: Error processing %(item)s" (:commit:`c16150c`) - lint and improve images pipeline error logging (:commit:`56b45fc`) - fixed doc typos (:commit:`243be84`) - add documentation topics: Broad Crawls & Common Practies (:commit:`1fbb715`) - fix bug in scrapy parse command when spider is not specified explicitly. closes #209 (:commit:`c72e682`) - Update docs/topics/commands.rst (:commit:`28eac7a`) 0.16.3 (released 2012-12-07) ---------------------------- - Remove concurrency limitation when using download delays and still ensure inter-request delays are enforced (:commit:`487b9b5`) - add error details when image pipeline fails (:commit:`8232569`) - improve mac os compatibility (:commit:`8dcf8aa`) - setup.py: use README.rst to populate long_description (:commit:`7b5310d`) - doc: removed obsolete references to ClientForm (:commit:`80f9bb6`) - correct docs for default storage backend (:commit:`2aa491b`) - doc: removed broken proxyhub link from FAQ (:commit:`bdf61c4`) - Fixed docs typo in SpiderOpenCloseLogging example (:commit:`7184094`) 0.16.2 (released 2012-11-09) ---------------------------- - scrapy contracts: python2.6 compat (:commit:`a4a9199`) - scrapy contracts verbose option (:commit:`ec41673`) - proper unittest-like output for scrapy contracts (:commit:`86635e4`) - added open_in_browser to debugging doc (:commit:`c9b690d`) - removed reference to global scrapy stats from settings doc (:commit:`dd55067`) - Fix SpiderState bug in Windows platforms (:commit:`58998f4`) 0.16.1 (released 2012-10-26) ---------------------------- - fixed LogStats extension, which got broken after a wrong merge before the 0.16 release (:commit:`8c780fd`) - better backwards compatibility for scrapy.conf.settings (:commit:`3403089`) - extended documentation on how to access crawler stats from extensions (:commit:`c4da0b5`) - removed .hgtags (no longer needed now that scrapy uses git) (:commit:`d52c188`) - fix dashes under rst headers (:commit:`fa4f7f9`) - set release date for 0.16.0 in news (:commit:`e292246`) 0.16.0 (released 2012-10-18) ---------------------------- Scrapy changes: - added :ref:`topics-contracts`, a mechanism for testing spiders in a formal/reproducible way - added options ``-o`` and ``-t`` to the :command:`runspider` command - documented :doc:`topics/autothrottle` and added to extensions installed by default. You still need to enable it with :setting:`AUTOTHROTTLE_ENABLED` - major Stats Collection refactoring: removed separation of global/per-spider stats, removed stats-related signals (``stats_spider_opened``, etc). Stats are much simpler now, backwards compatibility is kept on the Stats Collector API and signals. - added :meth:`~scrapy.contrib.spidermiddleware.SpiderMiddleware.process_start_requests` method to spider middlewares - dropped Signals singleton. Signals should now be accesed through the Crawler.signals attribute. See the signals documentation for more info. - dropped Signals singleton. Signals should now be accesed through the Crawler.signals attribute. See the signals documentation for more info. - dropped Stats Collector singleton. Stats can now be accessed through the Crawler.stats attribute. See the stats collection documentation for more info. - documented :ref:`topics-api` - `lxml` is now the default selectors backend instead of `libxml2` - ported FormRequest.from_response() to use `lxml`_ instead of `ClientForm`_ - removed modules: ``scrapy.xlib.BeautifulSoup`` and ``scrapy.xlib.ClientForm`` - SitemapSpider: added support for sitemap urls ending in .xml and .xml.gz, even if they advertise a wrong content type (:commit:`10ed28b`) - StackTraceDump extension: also dump trackref live references (:commit:`fe2ce93`) - nested items now fully supported in JSON and JSONLines exporters - added :reqmeta:`cookiejar` Request meta key to support multiple cookie sessions per spider - decoupled encoding detection code to `w3lib.encoding`_, and ported Scrapy code to use that mdule - dropped support for Python 2.5. See http://blog.scrapinghub.com/2012/02/27/scrapy-0-15-dropping-support-for-python-2-5/ - dropped support for Twisted 2.5 - added :setting:`REFERER_ENABLED` setting, to control referer middleware - changed default user agent to: ``Scrapy/VERSION (+http://scrapy.org)`` - removed (undocumented) ``HTMLImageLinkExtractor`` class from ``scrapy.contrib.linkextractors.image`` - removed per-spider settings (to be replaced by instantiating multiple crawler objects) - ``USER_AGENT`` spider attribute will no longer work, use ``user_agent`` attribute instead - ``DOWNLOAD_TIMEOUT`` spider attribute will no longer work, use ``download_timeout`` attribute instead - removed ``ENCODING_ALIASES`` setting, as encoding auto-detection has been moved to the `w3lib`_ library - promoted :ref:`topics-djangoitem` to main contrib - LogFormatter method now return dicts(instead of strings) to support lazy formatting (:issue:`164`, :commit:`dcef7b0`) - downloader handlers (:setting:`DOWNLOAD_HANDLERS` setting) now receive settings as the first argument of the constructor - replaced memory usage acounting with (more portable) `resource`_ module, removed ``scrapy.utils.memory`` module - removed signal: ``scrapy.mail.mail_sent`` - removed ``TRACK_REFS`` setting, now :ref:`trackrefs ` is always enabled - DBM is now the default storage backend for HTTP cache middleware - number of log messages (per level) are now tracked through Scrapy stats (stat name: ``log_count/LEVEL``) - number received responses are now tracked through Scrapy stats (stat name: ``response_received_count``) - removed ``scrapy.log.started`` attribute 0.14.4 ------ - added precise to supported ubuntu distros (:commit:`b7e46df`) - fixed bug in json-rpc webservice reported in https://groups.google.com/d/topic/scrapy-users/qgVBmFybNAQ/discussion. also removed no longer supported 'run' command from extras/scrapy-ws.py (:commit:`340fbdb`) - meta tag attributes for content-type http equiv can be in any order. #123 (:commit:`0cb68af`) - replace "import Image" by more standard "from PIL import Image". closes #88 (:commit:`4d17048`) - return trial status as bin/runtests.sh exit value. #118 (:commit:`b7b2e7f`) 0.14.3 ------ - forgot to include pydispatch license. #118 (:commit:`fd85f9c`) - include egg files used by testsuite in source distribution. #118 (:commit:`c897793`) - update docstring in project template to avoid confusion with genspider command, which may be considered as an advanced feature. refs #107 (:commit:`2548dcc`) - added note to docs/topics/firebug.rst about google directory being shut down (:commit:`668e352`) - dont discard slot when empty, just save in another dict in order to recycle if needed again. (:commit:`8e9f607`) - do not fail handling unicode xpaths in libxml2 backed selectors (:commit:`b830e95`) - fixed minor mistake in Request objects documentation (:commit:`bf3c9ee`) - fixed minor defect in link extractors documentation (:commit:`ba14f38`) - removed some obsolete remaining code related to sqlite support in scrapy (:commit:`0665175`) 0.14.2 ------ - move buffer pointing to start of file before computing checksum. refs #92 (:commit:`6a5bef2`) - Compute image checksum before persisting images. closes #92 (:commit:`9817df1`) - remove leaking references in cached failures (:commit:`673a120`) - fixed bug in MemoryUsage extension: get_engine_status() takes exactly 1 argument (0 given) (:commit:`11133e9`) - fixed struct.error on http compression middleware. closes #87 (:commit:`1423140`) - ajax crawling wasn't expanding for unicode urls (:commit:`0de3fb4`) - Catch start_requests iterator errors. refs #83 (:commit:`454a21d`) - Speed-up libxml2 XPathSelector (:commit:`2fbd662`) - updated versioning doc according to recent changes (:commit:`0a070f5`) - scrapyd: fixed documentation link (:commit:`2b4e4c3`) - extras/makedeb.py: no longer obtaining version from git (:commit:`caffe0e`) 0.14.1 ------ - extras/makedeb.py: no longer obtaining version from git (:commit:`caffe0e`) - bumped version to 0.14.1 (:commit:`6cb9e1c`) - fixed reference to tutorial directory (:commit:`4b86bd6`) - doc: removed duplicated callback argument from Request.replace() (:commit:`1aeccdd`) - fixed formatting of scrapyd doc (:commit:`8bf19e6`) - Dump stacks for all running threads and fix engine status dumped by StackTraceDump extension (:commit:`14a8e6e`) - added comment about why we disable ssl on boto images upload (:commit:`5223575`) - SSL handshaking hangs when doing too many parallel connections to S3 (:commit:`63d583d`) - change tutorial to follow changes on dmoz site (:commit:`bcb3198`) - Avoid _disconnectedDeferred AttributeError exception in Twisted>=11.1.0 (:commit:`98f3f87`) - allow spider to set autothrottle max concurrency (:commit:`175a4b5`) 0.14 ---- New features and settings ~~~~~~~~~~~~~~~~~~~~~~~~~ - Support for `AJAX crawleable urls`_ - New persistent scheduler that stores requests on disk, allowing to suspend and resume crawls (:rev:`2737`) - added ``-o`` option to ``scrapy crawl``, a shortcut for dumping scraped items into a file (or standard output using ``-``) - Added support for passing custom settings to Scrapyd ``schedule.json`` api (:rev:`2779`, :rev:`2783`) - New ``ChunkedTransferMiddleware`` (enabled by default) to support `chunked transfer encoding`_ (:rev:`2769`) - Add boto 2.0 support for S3 downloader handler (:rev:`2763`) - Added `marshal`_ to formats supported by feed exports (:rev:`2744`) - In request errbacks, offending requests are now received in `failure.request` attribute (:rev:`2738`) - Big downloader refactoring to support per domain/ip concurrency limits (:rev:`2732`) - ``CONCURRENT_REQUESTS_PER_SPIDER`` setting has been deprecated and replaced by: - :setting:`CONCURRENT_REQUESTS`, :setting:`CONCURRENT_REQUESTS_PER_DOMAIN`, :setting:`CONCURRENT_REQUESTS_PER_IP` - check the documentation for more details - Added builtin caching DNS resolver (:rev:`2728`) - Moved Amazon AWS-related components/extensions (SQS spider queue, SimpleDB stats collector) to a separate project: [scaws](https://github.com/scrapinghub/scaws) (:rev:`2706`, :rev:`2714`) - Moved spider queues to scrapyd: `scrapy.spiderqueue` -> `scrapyd.spiderqueue` (:rev:`2708`) - Moved sqlite utils to scrapyd: `scrapy.utils.sqlite` -> `scrapyd.sqlite` (:rev:`2781`) - Real support for returning iterators on `start_requests()` method. The iterator is now consumed during the crawl when the spider is getting idle (:rev:`2704`) - Added :setting:`REDIRECT_ENABLED` setting to quickly enable/disable the redirect middleware (:rev:`2697`) - Added :setting:`RETRY_ENABLED` setting to quickly enable/disable the retry middleware (:rev:`2694`) - Added ``CloseSpider`` exception to manually close spiders (:rev:`2691`) - Improved encoding detection by adding support for HTML5 meta charset declaration (:rev:`2690`) - Refactored close spider behavior to wait for all downloads to finish and be processed by spiders, before closing the spider (:rev:`2688`) - Added ``SitemapSpider`` (see documentation in Spiders page) (:rev:`2658`) - Added ``LogStats`` extension for periodically logging basic stats (like crawled pages and scraped items) (:rev:`2657`) - Make handling of gzipped responses more robust (#319, :rev:`2643`). Now Scrapy will try and decompress as much as possible from a gzipped response, instead of failing with an `IOError`. - Simplified !MemoryDebugger extension to use stats for dumping memory debugging info (:rev:`2639`) - Added new command to edit spiders: ``scrapy edit`` (:rev:`2636`) and `-e` flag to `genspider` command that uses it (:rev:`2653`) - Changed default representation of items to pretty-printed dicts. (:rev:`2631`). This improves default logging by making log more readable in the default case, for both Scraped and Dropped lines. - Added :signal:`spider_error` signal (:rev:`2628`) - Added :setting:`COOKIES_ENABLED` setting (:rev:`2625`) - Stats are now dumped to Scrapy log (default value of :setting:`STATS_DUMP` setting has been changed to `True`). This is to make Scrapy users more aware of Scrapy stats and the data that is collected there. - Added support for dynamically adjusting download delay and maximum concurrent requests (:rev:`2599`) - Added new DBM HTTP cache storage backend (:rev:`2576`) - Added ``listjobs.json`` API to Scrapyd (:rev:`2571`) - ``CsvItemExporter``: added ``join_multivalued`` parameter (:rev:`2578`) - Added namespace support to ``xmliter_lxml`` (:rev:`2552`) - Improved cookies middleware by making `COOKIES_DEBUG` nicer and documenting it (:rev:`2579`) - Several improvements to Scrapyd and Link extractors Code rearranged and removed ~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Merged item passed and item scraped concepts, as they have often proved confusing in the past. This means: (:rev:`2630`) - original item_scraped signal was removed - original item_passed signal was renamed to item_scraped - old log lines ``Scraped Item...`` were removed - old log lines ``Passed Item...`` were renamed to ``Scraped Item...`` lines and downgraded to ``DEBUG`` level - Reduced Scrapy codebase by striping part of Scrapy code into two new libraries: - `w3lib`_ (several functions from ``scrapy.utils.{http,markup,multipart,response,url}``, done in :rev:`2584`) - `scrapely`_ (was ``scrapy.contrib.ibl``, done in :rev:`2586`) - Removed unused function: `scrapy.utils.request.request_info()` (:rev:`2577`) - Removed googledir project from `examples/googledir`. There's now a new example project called `dirbot` available on github: https://github.com/scrapy/dirbot - Removed support for default field values in Scrapy items (:rev:`2616`) - Removed experimental crawlspider v2 (:rev:`2632`) - Removed scheduler middleware to simplify architecture. Duplicates filter is now done in the scheduler itself, using the same dupe fltering class as before (`DUPEFILTER_CLASS` setting) (:rev:`2640`) - Removed support for passing urls to ``scrapy crawl`` command (use ``scrapy parse`` instead) (:rev:`2704`) - Removed deprecated Execution Queue (:rev:`2704`) - Removed (undocumented) spider context extension (from scrapy.contrib.spidercontext) (:rev:`2780`) - removed ``CONCURRENT_SPIDERS`` setting (use scrapyd maxproc instead) (:rev:`2789`) - Renamed attributes of core components: downloader.sites -> downloader.slots, scraper.sites -> scraper.slots (:rev:`2717`, :rev:`2718`) - Renamed setting ``CLOSESPIDER_ITEMPASSED`` to :setting:`CLOSESPIDER_ITEMCOUNT` (:rev:`2655`). Backwards compatibility kept. 0.12 ---- The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available. New features and improvements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Passed item is now sent in the ``item`` argument of the :signal:`item_passed` (#273) - Added verbose option to ``scrapy version`` command, useful for bug reports (#298) - HTTP cache now stored by default in the project data dir (#279) - Added project data storage directory (#276, #277) - Documented file structure of Scrapy projects (see command-line tool doc) - New lxml backend for XPath selectors (#147) - Per-spider settings (#245) - Support exit codes to signal errors in Scrapy commands (#248) - Added ``-c`` argument to ``scrapy shell`` command - Made ``libxml2`` optional (#260) - New ``deploy`` command (#261) - Added :setting:`CLOSESPIDER_PAGECOUNT` setting (#253) - Added :setting:`CLOSESPIDER_ERRORCOUNT` setting (#254) Scrapyd changes ~~~~~~~~~~~~~~~ - Scrapyd now uses one process per spider - It stores one log file per spider run, and rotate them keeping the lastest 5 logs per spider (by default) - A minimal web ui was added, available at http://localhost:6800 by default - There is now a `scrapy server` command to start a Scrapyd server of the current project Changes to settings ~~~~~~~~~~~~~~~~~~~ - added `HTTPCACHE_ENABLED` setting (False by default) to enable HTTP cache middleware - changed `HTTPCACHE_EXPIRATION_SECS` semantics: now zero means "never expire". Deprecated/obsoleted functionality ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Deprecated ``runserver`` command in favor of ``server`` command which starts a Scrapyd server. See also: Scrapyd changes - Deprecated ``queue`` command in favor of using Scrapyd ``schedule.json`` API. See also: Scrapyd changes - Removed the !LxmlItemLoader (experimental contrib which never graduated to main contrib) 0.10 ---- The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available. New features and improvements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - New Scrapy service called ``scrapyd`` for deploying Scrapy crawlers in production (#218) (documentation available) - Simplified Images pipeline usage which doesn't require subclassing your own images pipeline now (#217) - Scrapy shell now shows the Scrapy log by default (#206) - Refactored execution queue in a common base code and pluggable backends called "spider queues" (#220) - New persistent spider queue (based on SQLite) (#198), available by default, which allows to start Scrapy in server mode and then schedule spiders to run. - Added documentation for Scrapy command-line tool and all its available sub-commands. (documentation available) - Feed exporters with pluggable backends (#197) (documentation available) - Deferred signals (#193) - Added two new methods to item pipeline open_spider(), close_spider() with deferred support (#195) - Support for overriding default request headers per spider (#181) - Replaced default Spider Manager with one with similar functionality but not depending on Twisted Plugins (#186) - Splitted Debian package into two packages - the library and the service (#187) - Scrapy log refactoring (#188) - New extension for keeping persistent spider contexts among different runs (#203) - Added `dont_redirect` request.meta key for avoiding redirects (#233) - Added `dont_retry` request.meta key for avoiding retries (#234) Command-line tool changes ~~~~~~~~~~~~~~~~~~~~~~~~~ - New `scrapy` command which replaces the old `scrapy-ctl.py` (#199) - there is only one global `scrapy` command now, instead of one `scrapy-ctl.py` per project - Added `scrapy.bat` script for running more conveniently from Windows - Added bash completion to command-line tool (#210) - Renamed command `start` to `runserver` (#209) API changes ~~~~~~~~~~~ - ``url`` and ``body`` attributes of Request objects are now read-only (#230) - ``Request.copy()`` and ``Request.replace()`` now also copies their ``callback`` and ``errback`` attributes (#231) - Removed ``UrlFilterMiddleware`` from ``scrapy.contrib`` (already disabled by default) - Offsite middelware doesn't filter out any request coming from a spider that doesn't have a allowed_domains attribute (#225) - Removed Spider Manager ``load()`` method. Now spiders are loaded in the constructor itself. - Changes to Scrapy Manager (now called "Crawler"): - ``scrapy.core.manager.ScrapyManager`` class renamed to ``scrapy.crawler.Crawler`` - ``scrapy.core.manager.scrapymanager`` singleton moved to ``scrapy.project.crawler`` - Moved module: ``scrapy.contrib.spidermanager`` to ``scrapy.spidermanager`` - Spider Manager singleton moved from ``scrapy.spider.spiders`` to the ``spiders` attribute of ``scrapy.project.crawler`` singleton. - moved Stats Collector classes: (#204) - ``scrapy.stats.collector.StatsCollector`` to ``scrapy.statscol.StatsCollector`` - ``scrapy.stats.collector.SimpledbStatsCollector`` to ``scrapy.contrib.statscol.SimpledbStatsCollector`` - default per-command settings are now specified in the ``default_settings`` attribute of command object class (#201) - changed arguments of Item pipeline ``process_item()`` method from ``(spider, item)`` to ``(item, spider)`` - backwards compatibility kept (with deprecation warning) - moved ``scrapy.core.signals`` module to ``scrapy.signals`` - backwards compatibility kept (with deprecation warning) - moved ``scrapy.core.exceptions`` module to ``scrapy.exceptions`` - backwards compatibility kept (with deprecation warning) - added ``handles_request()`` class method to ``BaseSpider`` - dropped ``scrapy.log.exc()`` function (use ``scrapy.log.err()`` instead) - dropped ``component`` argument of ``scrapy.log.msg()`` function - dropped ``scrapy.log.log_level`` attribute - Added ``from_settings()`` class methods to Spider Manager, and Item Pipeline Manager Changes to settings ~~~~~~~~~~~~~~~~~~~ - Added ``HTTPCACHE_IGNORE_SCHEMES`` setting to ignore certain schemes on !HttpCacheMiddleware (#225) - Added ``SPIDER_QUEUE_CLASS`` setting which defines the spider queue to use (#220) - Added ``KEEP_ALIVE`` setting (#220) - Removed ``SERVICE_QUEUE`` setting (#220) - Removed ``COMMANDS_SETTINGS_MODULE`` setting (#201) - Renamed ``REQUEST_HANDLERS`` to ``DOWNLOAD_HANDLERS`` and make download handlers classes (instead of functions) 0.9 --- The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available. New features and improvements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Added SMTP-AUTH support to scrapy.mail - New settings added: ``MAIL_USER``, ``MAIL_PASS`` (:rev:`2065` | #149) - Added new scrapy-ctl view command - To view URL in the browser, as seen by Scrapy (:rev:`2039`) - Added web service for controlling Scrapy process (this also deprecates the web console. (:rev:`2053` | #167) - Support for running Scrapy as a service, for production systems (:rev:`1988`, :rev:`2054`, :rev:`2055`, :rev:`2056`, :rev:`2057` | #168) - Added wrapper induction library (documentation only available in source code for now). (:rev:`2011`) - Simplified and improved response encoding support (:rev:`1961`, :rev:`1969`) - Added ``LOG_ENCODING`` setting (:rev:`1956`, documentation available) - Added ``RANDOMIZE_DOWNLOAD_DELAY`` setting (enabled by default) (:rev:`1923`, doc available) - ``MailSender`` is no longer IO-blocking (:rev:`1955` | #146) - Linkextractors and new Crawlspider now handle relative base tag urls (:rev:`1960` | #148) - Several improvements to Item Loaders and processors (:rev:`2022`, :rev:`2023`, :rev:`2024`, :rev:`2025`, :rev:`2026`, :rev:`2027`, :rev:`2028`, :rev:`2029`, :rev:`2030`) - Added support for adding variables to telnet console (:rev:`2047` | #165) - Support for requests without callbacks (:rev:`2050` | #166) API changes ~~~~~~~~~~~ - Change ``Spider.domain_name`` to ``Spider.name`` (SEP-012, :rev:`1975`) - ``Response.encoding`` is now the detected encoding (:rev:`1961`) - ``HttpErrorMiddleware`` now returns None or raises an exception (:rev:`2006` | #157) - ``scrapy.command`` modules relocation (:rev:`2035`, :rev:`2036`, :rev:`2037`) - Added ``ExecutionQueue`` for feeding spiders to scrape (:rev:`2034`) - Removed ``ExecutionEngine`` singleton (:rev:`2039`) - Ported ``S3ImagesStore`` (images pipeline) to use boto and threads (:rev:`2033`) - Moved module: ``scrapy.management.telnet`` to ``scrapy.telnet`` (:rev:`2047`) Changes to default settings ~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Changed default ``SCHEDULER_ORDER`` to ``DFO`` (:rev:`1939`) 0.8 --- The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available. New features ~~~~~~~~~~~~ - Added DEFAULT_RESPONSE_ENCODING setting (:rev:`1809`) - Added ``dont_click`` argument to ``FormRequest.from_response()`` method (:rev:`1813`, :rev:`1816`) - Added ``clickdata`` argument to ``FormRequest.from_response()`` method (:rev:`1802`, :rev:`1803`) - Added support for HTTP proxies (``HttpProxyMiddleware``) (:rev:`1781`, :rev:`1785`) - Offiste spider middleware now logs messages when filtering out requests (:rev:`1841`) Backwards-incompatible changes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Changed ``scrapy.utils.response.get_meta_refresh()`` signature (:rev:`1804`) - Removed deprecated ``scrapy.item.ScrapedItem`` class - use ``scrapy.item.Item instead`` (:rev:`1838`) - Removed deprecated ``scrapy.xpath`` module - use ``scrapy.selector`` instead. (:rev:`1836`) - Removed deprecated ``core.signals.domain_open`` signal - use ``core.signals.domain_opened`` instead (:rev:`1822`) - ``log.msg()`` now receives a ``spider`` argument (:rev:`1822`) - Old domain argument has been deprecated and will be removed in 0.9. For spiders, you should always use the ``spider`` argument and pass spider references. If you really want to pass a string, use the ``component`` argument instead. - Changed core signals ``domain_opened``, ``domain_closed``, ``domain_idle`` - Changed Item pipeline to use spiders instead of domains - The ``domain`` argument of ``process_item()`` item pipeline method was changed to ``spider``, the new signature is: ``process_item(spider, item)`` (:rev:`1827` | #105) - To quickly port your code (to work with Scrapy 0.8) just use ``spider.domain_name`` where you previously used ``domain``. - Changed Stats API to use spiders instead of domains (:rev:`1849` | #113) - ``StatsCollector`` was changed to receive spider references (instead of domains) in its methods (``set_value``, ``inc_value``, etc). - added ``StatsCollector.iter_spider_stats()`` method - removed ``StatsCollector.list_domains()`` method - Also, Stats signals were renamed and now pass around spider references (instead of domains). Here's a summary of the changes: - To quickly port your code (to work with Scrapy 0.8) just use ``spider.domain_name`` where you previously used ``domain``. ``spider_stats`` contains exactly the same data as ``domain_stats``. - ``CloseDomain`` extension moved to ``scrapy.contrib.closespider.CloseSpider`` (:rev:`1833`) - Its settings were also renamed: - ``CLOSEDOMAIN_TIMEOUT`` to ``CLOSESPIDER_TIMEOUT`` - ``CLOSEDOMAIN_ITEMCOUNT`` to ``CLOSESPIDER_ITEMCOUNT`` - Removed deprecated ``SCRAPYSETTINGS_MODULE`` environment variable - use ``SCRAPY_SETTINGS_MODULE`` instead (:rev:`1840`) - Renamed setting: ``REQUESTS_PER_DOMAIN`` to ``CONCURRENT_REQUESTS_PER_SPIDER`` (:rev:`1830`, :rev:`1844`) - Renamed setting: ``CONCURRENT_DOMAINS`` to ``CONCURRENT_SPIDERS`` (:rev:`1830`) - Refactored HTTP Cache middleware - HTTP Cache middleware has been heavilty refactored, retaining the same functionality except for the domain sectorization which was removed. (:rev:`1843` ) - Renamed exception: ``DontCloseDomain`` to ``DontCloseSpider`` (:rev:`1859` | #120) - Renamed extension: ``DelayedCloseDomain`` to ``SpiderCloseDelay`` (:rev:`1861` | #121) - Removed obsolete ``scrapy.utils.markup.remove_escape_chars`` function - use ``scrapy.utils.markup.replace_escape_chars`` instead (:rev:`1865`) 0.7 --- First release of Scrapy. .. _AJAX crawleable urls: http://code.google.com/web/ajaxcrawling/docs/getting-started.html .. _chunked transfer encoding: http://en.wikipedia.org/wiki/Chunked_transfer_encoding .. _w3lib: https://github.com/scrapy/w3lib .. _scrapely: https://github.com/scrapy/scrapely .. _marshal: http://docs.python.org/library/marshal.html .. _w3lib.encoding: https://github.com/scrapy/w3lib/blob/master/w3lib/encoding.py .. _lxml: http://lxml.de/ .. _ClientForm: http://wwwsearch.sourceforge.net/old/ClientForm/ .. _resource: http://docs.python.org/library/resource.html .. _queuelib: https://github.com/scrapy/queuelib .. _cssselect: https://github.com/SimonSapin/cssselect