scrapy

mirror of https://github.com/scrapy/scrapy.git synced 2025-02-24 19:24:12 +00:00

Author	SHA1	Message	Date
Pablo Hoffman	9f60c27612	added setting to support disabling DNS cache: DNSCACHE_ENABLED	2011-08-05 20:41:59 -03:00
Pablo Hoffman	cb95d7a5af	added marshal to formats supported by feed exports	2011-08-03 16:16:48 -03:00
Pablo Hoffman	549725215e	Initial support for a persistent scheduler, to support pausing and resuming crawls. * requests are serialized (using marshal by default) and stored on disk, using one queue per priority * request priorities must be integers now * breadh-first and depth-first crawling orders can now be configured through a new DEPTH_PRIORITY setting (see doc). backwards compatilibty with SCHEDULER_ORDER was kept. * requests that can't be serialized (for example, non serializable callbacks) are always kept in memory queues * adapted crawl spider to work with persitent scheduler	2011-08-02 11:57:55 -03:00
Pablo Hoffman	f354a49d0f	added FAQ about preventing bots getting banned	2011-07-28 00:40:30 -03:00
Pablo Hoffman	ce7a787970	Big downloader refactoring to support real concurrency limits per domain/ip, instead of global limits per spider which were a bit useless. This removes the setting CONCURRENT_REQUESTS_PER_SPIDER and adds thre new settings: * CONCURRENT_REQUESTS * CONCURRENT_REQUESTS_PER_DOMAIN * CONCURRENT_REQUESTS_PER_IP (overrides per domain) The AutoThrottle extension had to be disabled, but will be ported and re-enabled soon.	2011-07-27 13:38:09 -03:00
Pablo Hoffman	c59340150f	Added cached DNS resolver based on old caching resolver extension from scrapy.contrib.resolver. This new one is not an extension, it comes builtin and always enabled.	2011-07-27 03:45:15 -03:00
Pablo Hoffman	2ac08a713d	downloader: renamed SpiderInfo to Slot, for consistency with engine and scraper names	2011-07-22 02:06:10 -03:00
Pablo Hoffman	0e008268e1	removed SimpledbStatsCollector from scrapy code, it was moved to https://github.com/scrapinghub/scaws	2011-07-20 10:38:16 -03:00
Pablo Hoffman	b6b0a54d9f	removed FAQ entry	2011-07-20 01:31:36 -03:00
Pablo Hoffman	e3f640c7bf	added FAQ entry about scrapy deploy issue on Mac + Python 2.5	2011-07-19 19:53:32 -03:00
Pablo Hoffman	84f518fc5e	More core changes: * removed execution queue (replaced by newer spider queues) * added real support for returning iterators in Spider.start_requests() * removed support for passing urls to 'scrapy crawl' command	2011-07-15 15:18:39 -03:00
Pablo Hoffman	dbad1373f1	Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12	2011-07-13 18:44:54 -03:00
Pablo Hoffman	18cb4ff1d8	added natty to list of supporte ubuntu distros	2011-07-13 18:43:52 -03:00
Pablo Hoffman	39a2ea97c8	redirect mw: added REDIRECT_ENABLED setting and documented the other settings	2011-07-13 14:18:15 -03:00
Pablo Hoffman	541ed3913b	retry middleware: added RETRY_ENABLED setting and documented the other settings more properly, also improved messages when no longer retrying requests	2011-07-13 11:55:05 -03:00
Pablo Hoffman	763f3dc628	minor update to doc	2011-07-12 19:56:39 -03:00
Pablo Hoffman	bfda9ec319	added clarification about scrapy versioning including the recently adopted odd/even versioning scheme --HG-- rename : docs/api-stability.rst => docs/versioning.rst	2011-07-12 19:53:23 -03:00
Pablo Hoffman	4fde1ef94d	added CloseSpider exception, to manually close spiders	2011-07-12 14:24:10 -03:00
Pablo Hoffman	db5cae7c03	SitemapSpider: added support for filtering which sitemaps to follow (patch contributed by Rolando Espinoza). closes #330	2011-06-23 18:18:29 -03:00
Pablo Hoffman	57c43fdce6	added SitemapSpider, with tests and doc	2011-06-15 11:54:34 -03:00
Pablo Hoffman	91dc46539f	added LogStats extension for periodically logging basic stats (like crawled pages and scraped items)	2011-06-14 00:50:05 -03:00
Pablo Hoffman	841e9913db	renamed CLOSESPIDER_ITEMPASSED setting to CLOSESPIDER_ITEMCOUNT, to follow the refactoring done in r2630	2011-06-13 16:58:51 -03:00
Pablo Hoffman	474cba512c	simplified MemoryDebugger extension to use stats for dumping memory debugging info	2011-06-06 03:13:28 -03:00
Pablo Hoffman	5fbc32c015	call stats collector engine_stopped() after the engine is closed (to make sure all data from extensions has been collected), and added that method to documented api	2011-06-06 03:12:40 -03:00
Pablo Hoffman	9d9c8877da	added 'scrapy edit' command	2011-06-05 22:02:56 -03:00
Pablo Hoffman	03ae481cad	removed experimental crawlspider v2	2011-06-03 18:23:23 -03:00
Pablo Hoffman	5bf733b6f6	Changed default representation of items to pretty-printed dicts. This improves default logging by making log more readable in the default case, for both Scraped and Dropped lines. Projects can still customize how items are represented by overriding the item's __str__ method, as usual.	2011-06-03 01:13:01 -03:00
Pablo Hoffman	1bc2339bb8	Merged item passed and item scraped concepts, as they have often proved confusing in the past. This means: * original item_scraped signal was removed * original item_passed signal was renamed to item_scraped * old log lines "Scraped Item..." removed * old log lines "Passed Item..." renamed to "Scraped Item..."	2011-06-03 01:13:00 -03:00
Pablo Hoffman	e6091df551	fixed doc typo	2011-05-30 09:04:31 -03:00
Pablo Hoffman	1d98fc8fb5	added spider_error signal	2011-05-29 22:38:17 -03:00
Pablo Hoffman	2fa0f75f2d	added COOKIES_ENABLED setting to support disabling the cookies middleware	2011-05-27 00:35:34 -03:00
Pablo Hoffman	d72d3f4607	stack trace dump extension: also dump engine status, and support triggering it with SIGQUIT, besides SIGUSR2	2011-05-20 03:25:00 -03:00
Pablo Hoffman	951ba507f9	Removed support for default values in Scrapy items, which have proven confusing in the past	2011-05-19 21:42:46 -03:00
Pablo Hoffman	503f302010	removed remaining references to scheduler middleware from doc, as it will be removed on next release	2011-05-18 19:48:48 -03:00
Pablo Hoffman	3fd17432cf	fixed outdated documentation	2011-05-18 14:46:20 -03:00
Pablo Hoffman	9016e7e993	added role to link to scrapy source code (not yet used)	2011-05-18 14:43:34 -03:00
Pablo Hoffman	cd85c12c33	Some Link extractor improvements: * added support for ignoring common file extensions that are not followed if they occur in links * fixed link extractor documentation issues * slighly improved performance of applying filters * added link to link extractors doc from documentation index	2011-05-18 12:32:34 -03:00
Pablo Hoffman	495152bd50	disabled verbose depth stats collection by default, added DEPTH_STATS_VERBOSE setting to enable it	2011-05-18 11:04:48 -03:00
Pablo Hoffman	accb6ed830	dump stats to log by default (ie. change default value of STATS_DUMP to True)	2011-05-17 22:42:05 -03:00
Pablo Hoffman	7f97259ba7	added w3lib to requirements, in installation guide	2011-05-01 11:14:57 -03:00
Pablo Hoffman	4a83167698	fixed small doc typo	2011-04-30 01:35:30 -03:00
Pablo Hoffman	bb2b67c862	updated tutorial to use 'dmoz' as the name of the spider instead of 'dmoz.org', so that it's more similar to the dirbot example project	2011-04-28 09:31:57 -03:00
Pablo Hoffman	bf73002428	removed googledir example, replaced by dirbot project on github. updated docs accordingly	2011-04-28 02:28:39 -03:00
Pablo Hoffman	b12dd76bb8	Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12	2011-04-25 09:31:18 -03:00
Pablo Hoffman	678f08bc1b	added warning about using 'parse' as callback in crawl spider rules	2011-04-25 09:30:42 -03:00
Pablo Hoffman	ad496eb3b6	Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12	2011-04-14 12:36:27 -03:00
Pablo Hoffman	ecb4f44cbc	Added clarification on how to work with local settings and scrapy deploy	2011-04-14 12:36:09 -03:00
Pablo Hoffman	3ee2c94e93	Improved cookies middleware by making COOKIES_DEBUG nicer and documenting it	2011-04-06 14:54:48 -03:00
Pablo Hoffman	8a5c08a6bc	added join_multivalued parameter to CsvItemExporter	2011-03-24 13:15:52 -03:00
Pablo Hoffman	3954e600ca	added DBM storage backend for HTTP cache	2011-03-23 21:32:02 -03:00

1 2 3 4 5 ...

496 Commits