1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-26 07:03:49 +00:00

2735 Commits

Author SHA1 Message Date
Pablo Hoffman
f354a49d0f added FAQ about preventing bots getting banned 2011-07-28 00:40:30 -03:00
Pablo Hoffman
ce38022665 restored support for download delays after downlaoder refactoring, also restored support for spider attributes: max_concurrent_requests and download_delay 2011-07-27 15:14:27 -03:00
Pablo Hoffman
ce7a787970 Big downloader refactoring to support real concurrency limits per domain/ip,
instead of global limits per spider which were a bit useless.

This removes the setting CONCURRENT_REQUESTS_PER_SPIDER and adds thre new
settings:

* CONCURRENT_REQUESTS
* CONCURRENT_REQUESTS_PER_DOMAIN
* CONCURRENT_REQUESTS_PER_IP (overrides per domain)

The AutoThrottle extension had to be disabled, but will be ported and
re-enabled soon.
2011-07-27 13:38:09 -03:00
Pablo Hoffman
a45dca32f5 removed deprecation warning (scheduled to be removed on Scrapy 0.11) 2011-07-27 13:21:58 -03:00
Pablo Hoffman
b47f5330fd added fragment attribute ot Link object 2011-07-27 12:28:36 -03:00
Pablo Hoffman
e9ca309b6d minor comments adjustments 2011-07-27 03:48:38 -03:00
Pablo Hoffman
c59340150f Added cached DNS resolver based on old caching resolver extension from scrapy.contrib.resolver. This new one is *not* an extension, it comes builtin and always enabled. 2011-07-27 03:45:15 -03:00
Pablo Hoffman
90b716f7e4 downloader: removed unneeded code, and some minor refactoring 2011-07-26 19:06:52 -03:00
Pablo Hoffman
549298e38d spidermanager: more detailed error message now that scrapy crawl command will raise the exception directly 2011-07-26 19:05:29 -03:00
Pablo Hoffman
cb9c937f50 minor code rearrangement for consistency 2011-07-26 18:49:01 -03:00
Pablo Hoffman
dd020e184f removed rather useless (and some deprecated) docstrings 2011-07-26 18:45:51 -03:00
Pablo Hoffman
70493c754d retry middleware: added TCPTimedOutError to exceptions to retry 2011-07-25 14:52:24 -03:00
Pablo Hoffman
6f0e492390 fixed bug with scraper KeyError's on some ConnectionLost errors. closes #334 2011-07-25 12:24:26 -03:00
Pablo Hoffman
6e50f94406 engine: make it more explicit that we don't need to return the value of nextcall.schedule() 2011-07-25 10:47:54 -03:00
Pablo Hoffman
ea3bf6d95d more core refactoring including moving engine next request call logic to a separate class 2011-07-25 10:46:00 -03:00
Pablo Hoffman
209ecdf471 updated settings.py for djangoitem tests to new django multi-db format 2011-07-25 00:49:54 -03:00
Pablo Hoffman
2ac08a713d downloader: renamed SpiderInfo to Slot, for consistency with engine and scraper names 2011-07-22 02:06:10 -03:00
Pablo Hoffman
d6b83fee3e scraper: renamed SpiderInfo to Slot, for consistency with engine names 2011-07-22 02:01:05 -03:00
Pablo Hoffman
f19442425a forked UnicodeDammit from BeautifulSoup to explicitly disable usage of chardet library 2011-07-20 17:41:53 -03:00
Pablo Hoffman
7d18fe18e2 added missing import 2011-07-20 17:05:21 -03:00
Pablo Hoffman
0e008268e1 removed SimpledbStatsCollector from scrapy code, it was moved to https://github.com/scrapinghub/scaws 2011-07-20 10:38:16 -03:00
Pablo Hoffman
b6b0a54d9f removed FAQ entry 2011-07-20 01:31:36 -03:00
Pablo Hoffman
cc6ef3beb2 engine: renamed slot.requests to slot.start_requests 2011-07-20 01:18:34 -03:00
Pablo Hoffman
de0bf22010 speed up consumption of spider start requests, while the engine has capacity to process them 2011-07-20 01:18:00 -03:00
Pablo Hoffman
9f742fc97c removed unused import from 'crawl' spider template 2011-07-20 01:04:16 -03:00
Pablo Hoffman
e3f640c7bf added FAQ entry about scrapy deploy issue on Mac + Python 2.5 2011-07-19 19:53:32 -03:00
Pablo Hoffman
75e2c3eb33 moved spider queues to scrapyd
--HG--
rename : scrapy/spiderqueue.py => scrapyd/spiderqueue.py
rename : scrapy/tests/test_spiderqueue.py => scrapyd/tests/test_spiderqueue.py
2011-07-19 19:39:27 -03:00
Pablo Hoffman
d97d6d20c6 removed no longer used settings 2011-07-19 19:31:19 -03:00
Pablo Hoffman
442c0bdc18 removed SQSSpiderQueue from base scrapy code, it was moved to https://github.com/scrapinghub/scaws 2011-07-19 14:11:55 -03:00
Daniel Grana
bdd627fe1d allow overriding store_uri by extending ImagePipeline
--HG--
extra : rebase_source : 5c561b8282f733ab0f26607059dd96d858154426
2011-07-15 15:17:38 -03:00
Pablo Hoffman
84f518fc5e More core changes:
* removed execution queue (replaced by newer spider queues)
* added real support for returning iterators in Spider.start_requests()
* removed support for passing urls to 'scrapy crawl' command
2011-07-15 15:18:39 -03:00
Daniel Grana
4dadeb7ccb fix issue with responses preventing spiders to be idle in engine counts 2011-07-15 13:57:24 -03:00
Pablo Hoffman
d207c0afe4 fixed bug in engine.download() method 2011-07-15 12:55:07 -03:00
Pablo Hoffman
830255eea3 removed deprecated commands: queue, runserver 2011-07-14 01:41:24 -03:00
Pablo Hoffman
359129adf9 fixed python pass handling in cmdline/commands tests so that it works with new w3lib library 2011-07-14 01:40:31 -03:00
Pablo Hoffman
dbad1373f1 Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12 2011-07-13 18:44:54 -03:00
Pablo Hoffman
18cb4ff1d8 added natty to list of supporte ubuntu distros 2011-07-13 18:43:52 -03:00
Pablo Hoffman
39a2ea97c8 redirect mw: added REDIRECT_ENABLED setting and documented the other settings 2011-07-13 14:18:15 -03:00
Pablo Hoffman
0b6c7ce9b8 improved download errors propagation to the spiders, and removed no longer needed code to simplify 2011-07-13 14:10:05 -03:00
Pablo Hoffman
804c0279ec setup.py: only add lxml requirement if libxml2 is not available 2011-07-13 13:04:42 -03:00
Pablo Hoffman
541ed3913b retry middleware: added RETRY_ENABLED setting and documented the other settings more properly, also improved messages when no longer retrying requests 2011-07-13 11:55:05 -03:00
Pablo Hoffman
763f3dc628 minor update to doc 2011-07-12 19:56:39 -03:00
Pablo Hoffman
bfda9ec319 added clarification about scrapy versioning including the recently adopted odd/even versioning scheme
--HG--
rename : docs/api-stability.rst => docs/versioning.rst
2011-07-12 19:53:23 -03:00
Pablo Hoffman
4fde1ef94d added CloseSpider exception, to manually close spiders 2011-07-12 14:24:10 -03:00
Pablo Hoffman
4bb409923c improved encoding detection by adding support for HTML5 meta charset 2011-07-12 09:52:50 -03:00
Pablo Hoffman
67213ce673 logformatter: support non-ascii characters in custom implementations of Item.__str__() 2011-07-12 01:16:06 -03:00
Pablo Hoffman
31a375bde7 Close the scheduler after closing the scraper and downloader. This shouldn't have any real effect in practice, but it feels more appropiate to close the components in this order 2011-07-10 04:18:50 -03:00
Pablo Hoffman
90b1ae694c get_engine_status(): preserve test order defined in code 2011-07-10 04:10:20 -03:00
Pablo Hoffman
409aaade0b Refactored close spider behaviour so that the engine now waits for all
downloading (and enqueued for download) requests to finish and their responses
to be processed in the scraper/spiders, before closing the spider.

This will be required in the future to avoid loosing requests when we add
scheduler persistence and it's also a more correct behaviour overall.

The closing process has also been refactored to remove unneeded closing state
from downloader and leave it only in the engine.

Finally, some unused methods has been removed too, like spider_is_open() for
engine and scheduler.
2011-07-08 11:40:19 -03:00
Pablo Hoffman
574b070bb4 fixed minor bug in sitemap parser 2011-07-08 09:33:56 -03:00