crawls.
* requests are serialized (using marshal by default) and stored on disk, using
one queue per priority
* request priorities must be integers now
* breadh-first and depth-first crawling orders can now be configured
through a new DEPTH_PRIORITY setting (see doc). backwards compatilibty with
SCHEDULER_ORDER was kept.
* requests that can't be serialized (for example, non serializable callbacks)
are always kept in memory queues
* adapted crawl spider to work with persitent scheduler
instead of global limits per spider which were a bit useless.
This removes the setting CONCURRENT_REQUESTS_PER_SPIDER and adds thre new
settings:
* CONCURRENT_REQUESTS
* CONCURRENT_REQUESTS_PER_DOMAIN
* CONCURRENT_REQUESTS_PER_IP (overrides per domain)
The AutoThrottle extension had to be disabled, but will be ported and
re-enabled soon.
* moved scrapy.stats.collector.__init__ module to scrapy.statscol
* moved scrapy.stats.collector.simpledb module to scrapy.contrib.statscol
* moved signals from scrapy.stats.signals to scrapy.signals
* moved scrapy/stats/__init__.py to scrapy/stats.py
* updated documentation and tests accordingly
--HG--
rename : scrapy/stats/collector/simpledb.py => scrapy/contrib/statscol.py
rename : scrapy/stats/__init__.py => scrapy/stats.py
rename : scrapy/stats/collector/__init__.py => scrapy/statscol.py
* added encoding aliases, configurable through a new ENCODING_ALIASES setting
* Response.encoding now returns the real encoding detected for the body
* simplified TextResponse API by removing body_encoding() and
headers_encoding() methods
* Response.encoding now tries to infer the encoding from the body always (it
was done before only on HtmlResponse and TextResponse)
* removed scrapy.utils.encoding.add_encoding_alias() function
* updated implementation of scrapy.utils.response function to reflect these API
changes
* updated documentation to reflect API changes
* simplified code
* performance improvements
* removed awkward/unused domain sectorization
* it can now receive Settings on constructor
* added unittests
* added documentation about filesystem storage structure
Also made scrapy.conf.Settings objects instantiable with a dict which is used to override default settings.
* add REQUEST_HANDLERS setting with defaults for file, http and https schemes
* add documentation of new setting
* add unittests for all the builtin handlers
* remove unused getPage function
This is to avoid accessing the scrapy.spider.spiders singleton for "resolving"
spiders, which is considered an "evil" practice because it ties us to the
singleton model for the spider resolver, which is a bad thing.
This change will also work as the foundation for the API cleaning that we'll
perform for 0.8. We decided to introduce this change now to have a more common
basecode between 0.7 and 0.8, which will allow us to better support 0.7 until
0.8 is released.
However, this change doesn't modify the stable/documented API, nor does it
change the core logic. Those changes will land on the 0.8 branch, after 0.7 is
released.
--HG--
rename : scrapy/contrib/domainsch.py => scrapy/contrib/spiderscheduler.py