1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-25 22:04:05 +00:00

2901 Commits

Author SHA1 Message Date
Pablo Hoffman
5c938cc029 removed no longer working tests from get_engine_status() 2011-08-05 20:26:24 -03:00
Pablo Hoffman
38e193d480 MarshalDiskQueue bug fix 2011-08-05 17:06:31 -03:00
Pablo Hoffman
1ce84046d8 scheduler: bug fix to use in-memory queues when request can't be serialized by the disk-queues 2011-08-05 12:39:29 -03:00
Pablo Hoffman
76cbb6a2e6 removed wrong blocking api usage (socket.gethostbyname()) from downloader when using CONCURRENT_REQUESTS_PER_IP 2011-08-03 23:55:59 -03:00
Pablo Hoffman
ebb892e554 updated get_engine_status() after scheduler changes 2011-08-03 23:19:22 -03:00
Pablo Hoffman
cd8470b309 fixed crawlspider bug introduced after scheduler refactoring 2011-08-03 20:25:14 -03:00
Pablo Hoffman
cb95d7a5af added marshal to formats supported by feed exports 2011-08-03 16:16:48 -03:00
Pablo Hoffman
884dc93ab7 Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12 2011-08-02 22:46:02 -03:00
Pablo Hoffman
3191a08560 fix start_python_console() to work with IPython >= 0.11. closes #335 2011-08-02 22:45:51 -03:00
Pablo Hoffman
888dac867f fixed bug with scheduler.__len__() when disk queues are disabled 2011-08-02 15:56:32 -03:00
Pablo Hoffman
f8c9b40345 added __len__() method to scheduler 2011-08-02 15:46:35 -03:00
Pablo Hoffman
fbf0e9ef43 cleaned up lxml-based link extractors, and left one of them
--HG--
rename : scrapy/contrib/linkextractors/lxmlparser.py => scrapy/contrib/linkextractors/lxmlhtml.py
2011-08-02 15:10:25 -03:00
Pablo Hoffman
c6f29f02c4 tie request to downloader failures, so that they can be accessed from request errbacks 2011-08-02 12:02:08 -03:00
Pablo Hoffman
549725215e Initial support for a persistent scheduler, to support pausing and resuming
crawls.

* requests are serialized (using marshal by default) and stored on disk, using
  one queue per priority
* request priorities must be integers now
* breadh-first and depth-first crawling orders can now be configured
  through a new DEPTH_PRIORITY setting (see doc). backwards compatilibty with
  SCHEDULER_ORDER was kept.
* requests that can't be serialized (for example, non serializable callbacks)
  are always kept in memory queues
* adapted crawl spider to work with persitent scheduler
2011-08-02 11:57:55 -03:00
Pablo Hoffman
6d989e3fb0 imported patch scheduler_single_spider.patch 2011-07-31 03:32:25 -03:00
Pablo Hoffman
4d1e01a4d4 removed obsolete profiling/ dir 2011-07-31 02:47:14 -03:00
Pablo Hoffman
f354a49d0f added FAQ about preventing bots getting banned 2011-07-28 00:40:30 -03:00
Pablo Hoffman
ce38022665 restored support for download delays after downlaoder refactoring, also restored support for spider attributes: max_concurrent_requests and download_delay 2011-07-27 15:14:27 -03:00
Pablo Hoffman
ce7a787970 Big downloader refactoring to support real concurrency limits per domain/ip,
instead of global limits per spider which were a bit useless.

This removes the setting CONCURRENT_REQUESTS_PER_SPIDER and adds thre new
settings:

* CONCURRENT_REQUESTS
* CONCURRENT_REQUESTS_PER_DOMAIN
* CONCURRENT_REQUESTS_PER_IP (overrides per domain)

The AutoThrottle extension had to be disabled, but will be ported and
re-enabled soon.
2011-07-27 13:38:09 -03:00
Pablo Hoffman
a45dca32f5 removed deprecation warning (scheduled to be removed on Scrapy 0.11) 2011-07-27 13:21:58 -03:00
Pablo Hoffman
b47f5330fd added fragment attribute ot Link object 2011-07-27 12:28:36 -03:00
Pablo Hoffman
e9ca309b6d minor comments adjustments 2011-07-27 03:48:38 -03:00
Pablo Hoffman
c59340150f Added cached DNS resolver based on old caching resolver extension from scrapy.contrib.resolver. This new one is *not* an extension, it comes builtin and always enabled. 2011-07-27 03:45:15 -03:00
Pablo Hoffman
90b716f7e4 downloader: removed unneeded code, and some minor refactoring 2011-07-26 19:06:52 -03:00
Pablo Hoffman
549298e38d spidermanager: more detailed error message now that scrapy crawl command will raise the exception directly 2011-07-26 19:05:29 -03:00
Pablo Hoffman
cb9c937f50 minor code rearrangement for consistency 2011-07-26 18:49:01 -03:00
Pablo Hoffman
dd020e184f removed rather useless (and some deprecated) docstrings 2011-07-26 18:45:51 -03:00
Pablo Hoffman
70493c754d retry middleware: added TCPTimedOutError to exceptions to retry 2011-07-25 14:52:24 -03:00
Pablo Hoffman
6f0e492390 fixed bug with scraper KeyError's on some ConnectionLost errors. closes #334 2011-07-25 12:24:26 -03:00
Pablo Hoffman
6e50f94406 engine: make it more explicit that we don't need to return the value of nextcall.schedule() 2011-07-25 10:47:54 -03:00
Pablo Hoffman
ea3bf6d95d more core refactoring including moving engine next request call logic to a separate class 2011-07-25 10:46:00 -03:00
Pablo Hoffman
209ecdf471 updated settings.py for djangoitem tests to new django multi-db format 2011-07-25 00:49:54 -03:00
Pablo Hoffman
2ac08a713d downloader: renamed SpiderInfo to Slot, for consistency with engine and scraper names 2011-07-22 02:06:10 -03:00
Pablo Hoffman
d6b83fee3e scraper: renamed SpiderInfo to Slot, for consistency with engine names 2011-07-22 02:01:05 -03:00
Pablo Hoffman
f19442425a forked UnicodeDammit from BeautifulSoup to explicitly disable usage of chardet library 2011-07-20 17:41:53 -03:00
Pablo Hoffman
7d18fe18e2 added missing import 2011-07-20 17:05:21 -03:00
Pablo Hoffman
0e008268e1 removed SimpledbStatsCollector from scrapy code, it was moved to https://github.com/scrapinghub/scaws 2011-07-20 10:38:16 -03:00
Pablo Hoffman
b6b0a54d9f removed FAQ entry 2011-07-20 01:31:36 -03:00
Pablo Hoffman
cc6ef3beb2 engine: renamed slot.requests to slot.start_requests 2011-07-20 01:18:34 -03:00
Pablo Hoffman
de0bf22010 speed up consumption of spider start requests, while the engine has capacity to process them 2011-07-20 01:18:00 -03:00
Pablo Hoffman
9f742fc97c removed unused import from 'crawl' spider template 2011-07-20 01:04:16 -03:00
Pablo Hoffman
e3f640c7bf added FAQ entry about scrapy deploy issue on Mac + Python 2.5 2011-07-19 19:53:32 -03:00
Pablo Hoffman
75e2c3eb33 moved spider queues to scrapyd
--HG--
rename : scrapy/spiderqueue.py => scrapyd/spiderqueue.py
rename : scrapy/tests/test_spiderqueue.py => scrapyd/tests/test_spiderqueue.py
2011-07-19 19:39:27 -03:00
Pablo Hoffman
d97d6d20c6 removed no longer used settings 2011-07-19 19:31:19 -03:00
Pablo Hoffman
442c0bdc18 removed SQSSpiderQueue from base scrapy code, it was moved to https://github.com/scrapinghub/scaws 2011-07-19 14:11:55 -03:00
Daniel Grana
bdd627fe1d allow overriding store_uri by extending ImagePipeline
--HG--
extra : rebase_source : 5c561b8282f733ab0f26607059dd96d858154426
2011-07-15 15:17:38 -03:00
Pablo Hoffman
84f518fc5e More core changes:
* removed execution queue (replaced by newer spider queues)
* added real support for returning iterators in Spider.start_requests()
* removed support for passing urls to 'scrapy crawl' command
2011-07-15 15:18:39 -03:00
Daniel Grana
4dadeb7ccb fix issue with responses preventing spiders to be idle in engine counts 2011-07-15 13:57:24 -03:00
Pablo Hoffman
d207c0afe4 fixed bug in engine.download() method 2011-07-15 12:55:07 -03:00
Pablo Hoffman
830255eea3 removed deprecated commands: queue, runserver 2011-07-14 01:41:24 -03:00