crawls.
* requests are serialized (using marshal by default) and stored on disk, using
one queue per priority
* request priorities must be integers now
* breadh-first and depth-first crawling orders can now be configured
through a new DEPTH_PRIORITY setting (see doc). backwards compatilibty with
SCHEDULER_ORDER was kept.
* requests that can't be serialized (for example, non serializable callbacks)
are always kept in memory queues
* adapted crawl spider to work with persitent scheduler
instead of global limits per spider which were a bit useless.
This removes the setting CONCURRENT_REQUESTS_PER_SPIDER and adds thre new
settings:
* CONCURRENT_REQUESTS
* CONCURRENT_REQUESTS_PER_DOMAIN
* CONCURRENT_REQUESTS_PER_IP (overrides per domain)
The AutoThrottle extension had to be disabled, but will be ported and
re-enabled soon.
* removed execution queue (replaced by newer spider queues)
* added real support for returning iterators in Spider.start_requests()
* removed support for passing urls to 'scrapy crawl' command
confusing in the past.
This means:
* original item_scraped signal was removed
* original item_passed signal was renamed to item_scraped
* old log lines "Scraped Item..." removed
* old log lines "Passed Item..." renamed to "Scraped Item..."
* added support for ignoring common file extensions that are not followed if
they occur in links
* fixed link extractor documentation issues
* slighly improved performance of applying filters
* added link to link extractors doc from documentation index
- launcher no longer knows about egg storage
- removed get_spider_list_from_eggifile() file and replaced by simpler
get_spider_list() which doesn't receive en egg file as argument
- changed "egg runner" name to just "runner" to reflect the fact that it
doesn't necesarilly run eggs (though it does in the default case)
--HG--
rename : scrapyd/eggrunner.py => scrapyd/runner.py
* added minimal web ui
* return unique id per job (spider scheduled)
* store one log per spider run (job) and rotate them, keeping the last N logs (where N is configurable through settings)
* Always start one process per spider
* Added max_proc_per_cpu option (defaults to 4)
* Return the number of spiders (instead of a list of them) in schedule.json