1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-03-01 18:08:14 +00:00

754 Commits

Author SHA1 Message Date
Pablo Hoffman
ce7a787970 Big downloader refactoring to support real concurrency limits per domain/ip,
instead of global limits per spider which were a bit useless.

This removes the setting CONCURRENT_REQUESTS_PER_SPIDER and adds thre new
settings:

* CONCURRENT_REQUESTS
* CONCURRENT_REQUESTS_PER_DOMAIN
* CONCURRENT_REQUESTS_PER_IP (overrides per domain)

The AutoThrottle extension had to be disabled, but will be ported and
re-enabled soon.
2011-07-27 13:38:09 -03:00
Pablo Hoffman
2ac08a713d downloader: renamed SpiderInfo to Slot, for consistency with engine and scraper names 2011-07-22 02:06:10 -03:00
Pablo Hoffman
0e008268e1 removed SimpledbStatsCollector from scrapy code, it was moved to https://github.com/scrapinghub/scaws 2011-07-20 10:38:16 -03:00
Pablo Hoffman
84f518fc5e More core changes:
* removed execution queue (replaced by newer spider queues)
* added real support for returning iterators in Spider.start_requests()
* removed support for passing urls to 'scrapy crawl' command
2011-07-15 15:18:39 -03:00
Pablo Hoffman
dbad1373f1 Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12 2011-07-13 18:44:54 -03:00
Pablo Hoffman
18cb4ff1d8 added natty to list of supporte ubuntu distros 2011-07-13 18:43:52 -03:00
Pablo Hoffman
39a2ea97c8 redirect mw: added REDIRECT_ENABLED setting and documented the other settings 2011-07-13 14:18:15 -03:00
Pablo Hoffman
541ed3913b retry middleware: added RETRY_ENABLED setting and documented the other settings more properly, also improved messages when no longer retrying requests 2011-07-13 11:55:05 -03:00
Pablo Hoffman
4fde1ef94d added CloseSpider exception, to manually close spiders 2011-07-12 14:24:10 -03:00
Pablo Hoffman
db5cae7c03 SitemapSpider: added support for filtering which sitemaps to follow (patch contributed by Rolando Espinoza). closes #330 2011-06-23 18:18:29 -03:00
Pablo Hoffman
57c43fdce6 added SitemapSpider, with tests and doc 2011-06-15 11:54:34 -03:00
Pablo Hoffman
91dc46539f added LogStats extension for periodically logging basic stats (like crawled pages and scraped items) 2011-06-14 00:50:05 -03:00
Pablo Hoffman
841e9913db renamed CLOSESPIDER_ITEMPASSED setting to CLOSESPIDER_ITEMCOUNT, to follow the refactoring done in r2630 2011-06-13 16:58:51 -03:00
Pablo Hoffman
474cba512c simplified MemoryDebugger extension to use stats for dumping memory debugging info 2011-06-06 03:13:28 -03:00
Pablo Hoffman
5fbc32c015 call stats collector engine_stopped() after the engine is closed (to make sure all data from extensions has been collected), and added that method to documented api 2011-06-06 03:12:40 -03:00
Pablo Hoffman
9d9c8877da added 'scrapy edit' command 2011-06-05 22:02:56 -03:00
Pablo Hoffman
1bc2339bb8 Merged item passed and item scraped concepts, as they have often proved
confusing in the past.

This means:

* original item_scraped signal was removed
* original item_passed signal was renamed to item_scraped
* old log lines "Scraped Item..." removed
* old log lines "Passed Item..." renamed to "Scraped Item..."
2011-06-03 01:13:00 -03:00
Pablo Hoffman
e6091df551 fixed doc typo 2011-05-30 09:04:31 -03:00
Pablo Hoffman
1d98fc8fb5 added spider_error signal 2011-05-29 22:38:17 -03:00
Pablo Hoffman
2fa0f75f2d added COOKIES_ENABLED setting to support disabling the cookies middleware 2011-05-27 00:35:34 -03:00
Pablo Hoffman
d72d3f4607 stack trace dump extension: also dump engine status, and support triggering it with SIGQUIT, besides SIGUSR2 2011-05-20 03:25:00 -03:00
Pablo Hoffman
951ba507f9 Removed support for default values in Scrapy items, which have proven confusing in the past 2011-05-19 21:42:46 -03:00
Pablo Hoffman
503f302010 removed remaining references to scheduler middleware from doc, as it will be removed on next release 2011-05-18 19:48:48 -03:00
Pablo Hoffman
3fd17432cf fixed outdated documentation 2011-05-18 14:46:20 -03:00
Pablo Hoffman
cd85c12c33 Some Link extractor improvements:
* added support for ignoring common file extensions that are not followed if
  they occur in links
* fixed link extractor documentation issues
* slighly improved performance of applying filters
* added link to link extractors doc from documentation index
2011-05-18 12:32:34 -03:00
Pablo Hoffman
495152bd50 disabled verbose depth stats collection by default, added DEPTH_STATS_VERBOSE setting to enable it 2011-05-18 11:04:48 -03:00
Pablo Hoffman
accb6ed830 dump stats to log by default (ie. change default value of STATS_DUMP to True) 2011-05-17 22:42:05 -03:00
Pablo Hoffman
b12dd76bb8 Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12 2011-04-25 09:31:18 -03:00
Pablo Hoffman
678f08bc1b added warning about using 'parse' as callback in crawl spider rules 2011-04-25 09:30:42 -03:00
Pablo Hoffman
ad496eb3b6 Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12 2011-04-14 12:36:27 -03:00
Pablo Hoffman
ecb4f44cbc Added clarification on how to work with local settings and scrapy deploy 2011-04-14 12:36:09 -03:00
Pablo Hoffman
3ee2c94e93 Improved cookies middleware by making COOKIES_DEBUG nicer and documenting it 2011-04-06 14:54:48 -03:00
Pablo Hoffman
8a5c08a6bc added join_multivalued parameter to CsvItemExporter 2011-03-24 13:15:52 -03:00
Pablo Hoffman
3954e600ca added DBM storage backend for HTTP cache 2011-03-23 21:32:02 -03:00
Pablo Hoffman
cfd11df539 Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12 2011-02-24 15:28:57 -02:00
Pablo Hoffman
8f7e163b04 Fixed wrong method name in downloader middleware documentation 2011-02-24 15:26:32 -02:00
Pablo Hoffman
c91f0d9ea1 Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12 2011-02-04 13:39:54 -02:00
Pablo Hoffman
c5499ead73 Clarified behaviour when multiple rules match the same link in CrawlSpider 2011-02-04 13:39:12 -02:00
Pablo Hoffman
d7f193cbea bumped version to 0.13 in documentation 2011-01-02 17:29:43 -02:00
Pablo Hoffman
b56e933be9 bumped version to 0.12 in documentation 2011-01-02 17:28:33 -02:00
Pablo Hoffman
fa644f7a5e Some simplifications to Scrapyd architecture and internals:
- launcher no longer knows about egg storage
- removed get_spider_list_from_eggifile() file and replaced by simpler
  get_spider_list() which doesn't receive en egg file as argument
- changed "egg runner" name to just "runner" to reflect the fact that it
  doesn't necesarilly run eggs (though it does in the default case)

--HG--
rename : scrapyd/eggrunner.py => scrapyd/runner.py
2010-12-27 16:22:32 -02:00
Pablo Hoffman
544308d6d0 updated ubuntu repos doc, in preparation for the 0.11 release 2010-12-21 11:02:56 -02:00
Pablo Hoffman
002abf204f Updated item_passed signal to send passed item in 'item' argument, instead of 'output' argument, keeping backwards compatibility for the 'output' argument. Closes #273 2010-12-13 14:05:47 -02:00
Pablo Hoffman
f984d438a0 updated docs to use scrapy version on aptitude install lines 2010-12-13 14:02:42 -02:00
Pablo Hoffman
119fd20e91 Added verbose option to 'version' command. Closes #298 2010-12-13 00:32:44 -02:00
Pablo Hoffman
6a1b69c93f renamed command 'scrapyd' to 'server', and deprecated 'runserver' and 'queue' commands
--HG--
rename : scrapy/commands/scrapyd.py => scrapy/commands/server.py
2010-11-30 20:23:27 -02:00
Pablo Hoffman
df54ed0041 Some Scrapyd enhancements:
* added minimal web ui
* return unique id per job (spider scheduled)
* store one log per spider run (job) and rotate them, keeping the last N logs (where N is configurable through settings)
2010-11-30 02:26:31 -02:00
Pablo Hoffman
bbffa59497 Some changes to Scrapyd:
* Always start one process per spider
* Added max_proc_per_cpu option (defaults to 4)
* Return the number of spiders (instead of a list of them) in schedule.json
2010-11-29 17:19:05 -02:00
Pablo Hoffman
2557777c39 Updated doc referring to HTTP cache middleware 2010-11-24 13:27:44 -02:00
Pablo Hoffman
91a7c25797 * Made Response.meta attribute map to Request.meta attribute. Closes #290
* Record redirected URLs in redirect middleware. Closes #291
2010-11-18 12:51:54 -02:00