scrapy

mirror of https://github.com/scrapy/scrapy.git synced 2025-02-26 07:03:49 +00:00

Author	SHA1	Message	Date
Pablo Hoffman	f354a49d0f	added FAQ about preventing bots getting banned	2011-07-28 00:40:30 -03:00
Pablo Hoffman	ce38022665	restored support for download delays after downlaoder refactoring, also restored support for spider attributes: max_concurrent_requests and download_delay	2011-07-27 15:14:27 -03:00
Pablo Hoffman	ce7a787970	Big downloader refactoring to support real concurrency limits per domain/ip, instead of global limits per spider which were a bit useless. This removes the setting CONCURRENT_REQUESTS_PER_SPIDER and adds thre new settings: * CONCURRENT_REQUESTS * CONCURRENT_REQUESTS_PER_DOMAIN * CONCURRENT_REQUESTS_PER_IP (overrides per domain) The AutoThrottle extension had to be disabled, but will be ported and re-enabled soon.	2011-07-27 13:38:09 -03:00
Pablo Hoffman	a45dca32f5	removed deprecation warning (scheduled to be removed on Scrapy 0.11)	2011-07-27 13:21:58 -03:00
Pablo Hoffman	b47f5330fd	added fragment attribute ot Link object	2011-07-27 12:28:36 -03:00
Pablo Hoffman	e9ca309b6d	minor comments adjustments	2011-07-27 03:48:38 -03:00
Pablo Hoffman	c59340150f	Added cached DNS resolver based on old caching resolver extension from scrapy.contrib.resolver. This new one is not an extension, it comes builtin and always enabled.	2011-07-27 03:45:15 -03:00
Pablo Hoffman	90b716f7e4	downloader: removed unneeded code, and some minor refactoring	2011-07-26 19:06:52 -03:00
Pablo Hoffman	549298e38d	spidermanager: more detailed error message now that scrapy crawl command will raise the exception directly	2011-07-26 19:05:29 -03:00
Pablo Hoffman	cb9c937f50	minor code rearrangement for consistency	2011-07-26 18:49:01 -03:00
Pablo Hoffman	dd020e184f	removed rather useless (and some deprecated) docstrings	2011-07-26 18:45:51 -03:00
Pablo Hoffman	70493c754d	retry middleware: added TCPTimedOutError to exceptions to retry	2011-07-25 14:52:24 -03:00
Pablo Hoffman	6f0e492390	fixed bug with scraper KeyError's on some ConnectionLost errors. closes #334	2011-07-25 12:24:26 -03:00
Pablo Hoffman	6e50f94406	engine: make it more explicit that we don't need to return the value of nextcall.schedule()	2011-07-25 10:47:54 -03:00
Pablo Hoffman	ea3bf6d95d	more core refactoring including moving engine next request call logic to a separate class	2011-07-25 10:46:00 -03:00
Pablo Hoffman	209ecdf471	updated settings.py for djangoitem tests to new django multi-db format	2011-07-25 00:49:54 -03:00
Pablo Hoffman	2ac08a713d	downloader: renamed SpiderInfo to Slot, for consistency with engine and scraper names	2011-07-22 02:06:10 -03:00
Pablo Hoffman	d6b83fee3e	scraper: renamed SpiderInfo to Slot, for consistency with engine names	2011-07-22 02:01:05 -03:00
Pablo Hoffman	f19442425a	forked UnicodeDammit from BeautifulSoup to explicitly disable usage of chardet library	2011-07-20 17:41:53 -03:00
Pablo Hoffman	7d18fe18e2	added missing import	2011-07-20 17:05:21 -03:00
Pablo Hoffman	0e008268e1	removed SimpledbStatsCollector from scrapy code, it was moved to https://github.com/scrapinghub/scaws	2011-07-20 10:38:16 -03:00
Pablo Hoffman	b6b0a54d9f	removed FAQ entry	2011-07-20 01:31:36 -03:00
Pablo Hoffman	cc6ef3beb2	engine: renamed slot.requests to slot.start_requests	2011-07-20 01:18:34 -03:00
Pablo Hoffman	de0bf22010	speed up consumption of spider start requests, while the engine has capacity to process them	2011-07-20 01:18:00 -03:00
Pablo Hoffman	9f742fc97c	removed unused import from 'crawl' spider template	2011-07-20 01:04:16 -03:00
Pablo Hoffman	e3f640c7bf	added FAQ entry about scrapy deploy issue on Mac + Python 2.5	2011-07-19 19:53:32 -03:00
Pablo Hoffman	75e2c3eb33	moved spider queues to scrapyd --HG-- rename : scrapy/spiderqueue.py => scrapyd/spiderqueue.py rename : scrapy/tests/test_spiderqueue.py => scrapyd/tests/test_spiderqueue.py	2011-07-19 19:39:27 -03:00
Pablo Hoffman	d97d6d20c6	removed no longer used settings	2011-07-19 19:31:19 -03:00
Pablo Hoffman	442c0bdc18	removed SQSSpiderQueue from base scrapy code, it was moved to https://github.com/scrapinghub/scaws	2011-07-19 14:11:55 -03:00
Daniel Grana	bdd627fe1d	allow overriding store_uri by extending ImagePipeline --HG-- extra : rebase_source : 5c561b8282f733ab0f26607059dd96d858154426	2011-07-15 15:17:38 -03:00
Pablo Hoffman	84f518fc5e	More core changes: * removed execution queue (replaced by newer spider queues) * added real support for returning iterators in Spider.start_requests() * removed support for passing urls to 'scrapy crawl' command	2011-07-15 15:18:39 -03:00
Daniel Grana	4dadeb7ccb	fix issue with responses preventing spiders to be idle in engine counts	2011-07-15 13:57:24 -03:00
Pablo Hoffman	d207c0afe4	fixed bug in engine.download() method	2011-07-15 12:55:07 -03:00
Pablo Hoffman	830255eea3	removed deprecated commands: queue, runserver	2011-07-14 01:41:24 -03:00
Pablo Hoffman	359129adf9	fixed python pass handling in cmdline/commands tests so that it works with new w3lib library	2011-07-14 01:40:31 -03:00
Pablo Hoffman	dbad1373f1	Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12	2011-07-13 18:44:54 -03:00
Pablo Hoffman	18cb4ff1d8	added natty to list of supporte ubuntu distros	2011-07-13 18:43:52 -03:00
Pablo Hoffman	39a2ea97c8	redirect mw: added REDIRECT_ENABLED setting and documented the other settings	2011-07-13 14:18:15 -03:00
Pablo Hoffman	0b6c7ce9b8	improved download errors propagation to the spiders, and removed no longer needed code to simplify	2011-07-13 14:10:05 -03:00
Pablo Hoffman	804c0279ec	setup.py: only add lxml requirement if libxml2 is not available	2011-07-13 13:04:42 -03:00
Pablo Hoffman	541ed3913b	retry middleware: added RETRY_ENABLED setting and documented the other settings more properly, also improved messages when no longer retrying requests	2011-07-13 11:55:05 -03:00
Pablo Hoffman	763f3dc628	minor update to doc	2011-07-12 19:56:39 -03:00
Pablo Hoffman	bfda9ec319	added clarification about scrapy versioning including the recently adopted odd/even versioning scheme --HG-- rename : docs/api-stability.rst => docs/versioning.rst	2011-07-12 19:53:23 -03:00
Pablo Hoffman	4fde1ef94d	added CloseSpider exception, to manually close spiders	2011-07-12 14:24:10 -03:00
Pablo Hoffman	4bb409923c	improved encoding detection by adding support for HTML5 meta charset	2011-07-12 09:52:50 -03:00
Pablo Hoffman	67213ce673	logformatter: support non-ascii characters in custom implementations of Item.__str__()	2011-07-12 01:16:06 -03:00
Pablo Hoffman	31a375bde7	Close the scheduler after closing the scraper and downloader. This shouldn't have any real effect in practice, but it feels more appropiate to close the components in this order	2011-07-10 04:18:50 -03:00
Pablo Hoffman	90b1ae694c	get_engine_status(): preserve test order defined in code	2011-07-10 04:10:20 -03:00
Pablo Hoffman	409aaade0b	Refactored close spider behaviour so that the engine now waits for all downloading (and enqueued for download) requests to finish and their responses to be processed in the scraper/spiders, before closing the spider. This will be required in the future to avoid loosing requests when we add scheduler persistence and it's also a more correct behaviour overall. The closing process has also been refactored to remove unneeded closing state from downloader and leave it only in the engine. Finally, some unused methods has been removed too, like spider_is_open() for engine and scheduler.	2011-07-08 11:40:19 -03:00
Pablo Hoffman	574b070bb4	fixed minor bug in sitemap parser	2011-07-08 09:33:56 -03:00

1 2 3 4 5 ...

2735 Commits