scrapy

mirror of https://github.com/scrapy/scrapy.git synced 2025-02-26 00:03:57 +00:00

Author	SHA1	Message	Date
Pablo Hoffman	bff3d31469	scrapyd: updated schedule.json response format	2011-09-04 09:29:24 -03:00
Pablo Hoffman	a1dbc62b45	removed CONCURRENT_SPIDERS setting (use scrapyd maxproc instead)	2011-09-02 18:27:39 -03:00
Pablo Hoffman	40f7075f11	added initial documentation about suspend and resume crawls	2011-09-02 13:12:27 -03:00
Pablo Hoffman	27dd68a690	added SpiderState extension	2011-09-02 13:06:59 -03:00
Pablo Hoffman	6a31ab667d	minor fix to doc	2011-09-01 15:08:23 -03:00
Pablo Hoffman	d98b058c21	no longer recommend using labmda's in the doc, as they're not friendly with scheduler persistence	2011-09-01 15:06:49 -03:00
Pablo Hoffman	76af0cdd44	updated documentation and code to use -s instead of --set option	2011-09-01 14:35:37 -03:00
Pablo Hoffman	98b68ca89d	scrapyd: documented support for passing setting to spiders in schedule.json	2011-08-27 01:31:12 -03:00
Pablo Hoffman	5c6b0631e2	minor doc fix	2011-08-19 11:42:03 -03:00
Pablo Hoffman	9d97e73a24	fixed priority handling on the new scheduler so that it's backwards compatible (ie. bigger priorities are higher). also fixed a few documentation bugs related to requests priority	2011-08-19 08:26:41 -03:00
Pablo Hoffman	a3697421c0	some minor updates to documentation	2011-08-11 09:19:59 -03:00
Pablo Hoffman	5da6ffb57b	Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12	2011-08-11 09:11:19 -03:00
Pablo Hoffman	bc2d2183e9	fixed import in doc	2011-08-11 09:11:08 -03:00
Pablo Hoffman	19e6da59d8	added new downloader middleware: ChunkedTransferMiddleware	2011-08-09 03:03:25 -03:00
Pablo Hoffman	984be35461	Some telnet console changes: * renamed manager alias to crawler * added aliases: spider, slot * fixed est() function	2011-08-08 15:01:08 -03:00
Pablo Hoffman	f7c0aeccc6	added note about engine_started signal	2011-08-07 03:57:09 -03:00
Pablo Hoffman	9f60c27612	added setting to support disabling DNS cache: DNSCACHE_ENABLED	2011-08-05 20:41:59 -03:00
Pablo Hoffman	cb95d7a5af	added marshal to formats supported by feed exports	2011-08-03 16:16:48 -03:00
Pablo Hoffman	549725215e	Initial support for a persistent scheduler, to support pausing and resuming crawls. * requests are serialized (using marshal by default) and stored on disk, using one queue per priority * request priorities must be integers now * breadh-first and depth-first crawling orders can now be configured through a new DEPTH_PRIORITY setting (see doc). backwards compatilibty with SCHEDULER_ORDER was kept. * requests that can't be serialized (for example, non serializable callbacks) are always kept in memory queues * adapted crawl spider to work with persitent scheduler	2011-08-02 11:57:55 -03:00
Pablo Hoffman	f354a49d0f	added FAQ about preventing bots getting banned	2011-07-28 00:40:30 -03:00
Pablo Hoffman	ce7a787970	Big downloader refactoring to support real concurrency limits per domain/ip, instead of global limits per spider which were a bit useless. This removes the setting CONCURRENT_REQUESTS_PER_SPIDER and adds thre new settings: * CONCURRENT_REQUESTS * CONCURRENT_REQUESTS_PER_DOMAIN * CONCURRENT_REQUESTS_PER_IP (overrides per domain) The AutoThrottle extension had to be disabled, but will be ported and re-enabled soon.	2011-07-27 13:38:09 -03:00
Pablo Hoffman	c59340150f	Added cached DNS resolver based on old caching resolver extension from scrapy.contrib.resolver. This new one is not an extension, it comes builtin and always enabled.	2011-07-27 03:45:15 -03:00
Pablo Hoffman	2ac08a713d	downloader: renamed SpiderInfo to Slot, for consistency with engine and scraper names	2011-07-22 02:06:10 -03:00
Pablo Hoffman	0e008268e1	removed SimpledbStatsCollector from scrapy code, it was moved to https://github.com/scrapinghub/scaws	2011-07-20 10:38:16 -03:00
Pablo Hoffman	b6b0a54d9f	removed FAQ entry	2011-07-20 01:31:36 -03:00
Pablo Hoffman	e3f640c7bf	added FAQ entry about scrapy deploy issue on Mac + Python 2.5	2011-07-19 19:53:32 -03:00
Pablo Hoffman	84f518fc5e	More core changes: * removed execution queue (replaced by newer spider queues) * added real support for returning iterators in Spider.start_requests() * removed support for passing urls to 'scrapy crawl' command	2011-07-15 15:18:39 -03:00
Pablo Hoffman	dbad1373f1	Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12	2011-07-13 18:44:54 -03:00
Pablo Hoffman	18cb4ff1d8	added natty to list of supporte ubuntu distros	2011-07-13 18:43:52 -03:00
Pablo Hoffman	39a2ea97c8	redirect mw: added REDIRECT_ENABLED setting and documented the other settings	2011-07-13 14:18:15 -03:00
Pablo Hoffman	541ed3913b	retry middleware: added RETRY_ENABLED setting and documented the other settings more properly, also improved messages when no longer retrying requests	2011-07-13 11:55:05 -03:00
Pablo Hoffman	763f3dc628	minor update to doc	2011-07-12 19:56:39 -03:00
Pablo Hoffman	bfda9ec319	added clarification about scrapy versioning including the recently adopted odd/even versioning scheme --HG-- rename : docs/api-stability.rst => docs/versioning.rst	2011-07-12 19:53:23 -03:00
Pablo Hoffman	4fde1ef94d	added CloseSpider exception, to manually close spiders	2011-07-12 14:24:10 -03:00
Pablo Hoffman	db5cae7c03	SitemapSpider: added support for filtering which sitemaps to follow (patch contributed by Rolando Espinoza). closes #330	2011-06-23 18:18:29 -03:00
Pablo Hoffman	57c43fdce6	added SitemapSpider, with tests and doc	2011-06-15 11:54:34 -03:00
Pablo Hoffman	91dc46539f	added LogStats extension for periodically logging basic stats (like crawled pages and scraped items)	2011-06-14 00:50:05 -03:00
Pablo Hoffman	841e9913db	renamed CLOSESPIDER_ITEMPASSED setting to CLOSESPIDER_ITEMCOUNT, to follow the refactoring done in r2630	2011-06-13 16:58:51 -03:00
Pablo Hoffman	474cba512c	simplified MemoryDebugger extension to use stats for dumping memory debugging info	2011-06-06 03:13:28 -03:00
Pablo Hoffman	5fbc32c015	call stats collector engine_stopped() after the engine is closed (to make sure all data from extensions has been collected), and added that method to documented api	2011-06-06 03:12:40 -03:00
Pablo Hoffman	9d9c8877da	added 'scrapy edit' command	2011-06-05 22:02:56 -03:00
Pablo Hoffman	03ae481cad	removed experimental crawlspider v2	2011-06-03 18:23:23 -03:00
Pablo Hoffman	5bf733b6f6	Changed default representation of items to pretty-printed dicts. This improves default logging by making log more readable in the default case, for both Scraped and Dropped lines. Projects can still customize how items are represented by overriding the item's __str__ method, as usual.	2011-06-03 01:13:01 -03:00
Pablo Hoffman	1bc2339bb8	Merged item passed and item scraped concepts, as they have often proved confusing in the past. This means: * original item_scraped signal was removed * original item_passed signal was renamed to item_scraped * old log lines "Scraped Item..." removed * old log lines "Passed Item..." renamed to "Scraped Item..."	2011-06-03 01:13:00 -03:00
Pablo Hoffman	e6091df551	fixed doc typo	2011-05-30 09:04:31 -03:00
Pablo Hoffman	1d98fc8fb5	added spider_error signal	2011-05-29 22:38:17 -03:00
Pablo Hoffman	2fa0f75f2d	added COOKIES_ENABLED setting to support disabling the cookies middleware	2011-05-27 00:35:34 -03:00
Pablo Hoffman	d72d3f4607	stack trace dump extension: also dump engine status, and support triggering it with SIGQUIT, besides SIGUSR2	2011-05-20 03:25:00 -03:00
Pablo Hoffman	951ba507f9	Removed support for default values in Scrapy items, which have proven confusing in the past	2011-05-19 21:42:46 -03:00
Pablo Hoffman	503f302010	removed remaining references to scheduler middleware from doc, as it will be removed on next release	2011-05-18 19:48:48 -03:00

... 8 9 10 11 12 ...

962 Commits