Pablo Hoffman
a3697421c0
some minor updates to documentation
2011-08-11 09:19:59 -03:00
Pablo Hoffman
5da6ffb57b
Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12
2011-08-11 09:11:19 -03:00
Pablo Hoffman
bc2d2183e9
fixed import in doc
2011-08-11 09:11:08 -03:00
Pablo Hoffman
19e6da59d8
added new downloader middleware: ChunkedTransferMiddleware
2011-08-09 03:03:25 -03:00
Pablo Hoffman
4db2a592e5
Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12
2011-08-09 01:42:11 -03:00
Pablo Hoffman
09af0866c7
scrapy.utils.python: fixed bug introduced when adding support for new IPython 0.11. refs #335
2011-08-09 01:41:48 -03:00
Pablo Hoffman
415933a1f6
scrapy.utils.python: fixed bug introduced when adding support for new IPython 0.11. refs #335
2011-08-09 01:38:31 -03:00
Pablo Hoffman
061132ce88
fixed backwards compability with images/media pipeline, after crawler singleton removal in r2758
2011-08-08 17:16:32 -03:00
Pablo Hoffman
2108517ce0
removed support for passing more than a single spider on 'scrapy crawl' command
2011-08-08 15:51:22 -03:00
Daniel Grana
436ad63930
support s3 signing on pre and post boto v2.0
...
--HG--
extra : rebase_source : 1d8cd5dfceeaf63975c46014b100d70f6ed36147
2011-08-08 14:29:28 -03:00
Pablo Hoffman
c64123cc63
proper fix to what r2760 is supposed to fix
2011-08-08 15:09:43 -03:00
Pablo Hoffman
984be35461
Some telnet console changes:
...
* renamed manager alias to crawler
* added aliases: spider, slot
* fixed est() function
2011-08-08 15:01:08 -03:00
Pablo Hoffman
f03af7874d
fixed bug in scheduler 'has_pending_requests' method which prevented spiders to close properly in some cases
2011-08-08 14:52:54 -03:00
Daniel Grana
c35a7519c0
Correctly handle query parameters on s3:// urls
2011-08-08 13:23:45 -03:00
Pablo Hoffman
5c63b2307f
Another step towards singleton removal: deprecated crawler singleton import (from scrapy.project import crawler) by a new class method that extensions can implement to receive the crawler
2011-08-08 11:42:44 -03:00
Pablo Hoffman
0eaa1d95f6
replaced DeprecationWarning by a new ScrapyDeprecationWarning category, since the default DeprecationWarning is silenced on Python 2.7+
2011-08-08 10:39:53 -03:00
Pablo Hoffman
f7c0aeccc6
added note about engine_started signal
2011-08-07 03:57:09 -03:00
Pablo Hoffman
a2b0737a1d
scrapy.utils.sitemap: added one more case of parsing invalid sitemaps
2011-08-07 03:24:32 -03:00
Pablo Hoffman
cea0dae1b2
scrapy.utils.sitemap: added support for parsing sitemaps with wrong namespaces, found in some bogus websites
2011-08-07 03:13:55 -03:00
Pablo Hoffman
259dccaf58
moved module scrapy.core.downloader.responsetypes to scrapy.responsetypes
...
--HG--
rename : scrapy/core/downloader/responsetypes/mime.types => scrapy/mime.types
rename : scrapy/core/downloader/responsetypes/__init__.py => scrapy/responsetypes.py
2011-08-07 02:49:57 -03:00
Pablo Hoffman
9f60c27612
added setting to support disabling DNS cache: DNSCACHE_ENABLED
2011-08-05 20:41:59 -03:00
Pablo Hoffman
bb67cfd955
added MarshalDiskQueue unittests
2011-08-05 20:32:22 -03:00
Pablo Hoffman
5c938cc029
removed no longer working tests from get_engine_status()
2011-08-05 20:26:24 -03:00
Pablo Hoffman
38e193d480
MarshalDiskQueue bug fix
2011-08-05 17:06:31 -03:00
Pablo Hoffman
1ce84046d8
scheduler: bug fix to use in-memory queues when request can't be serialized by the disk-queues
2011-08-05 12:39:29 -03:00
Pablo Hoffman
76cbb6a2e6
removed wrong blocking api usage (socket.gethostbyname()) from downloader when using CONCURRENT_REQUESTS_PER_IP
2011-08-03 23:55:59 -03:00
Pablo Hoffman
ebb892e554
updated get_engine_status() after scheduler changes
2011-08-03 23:19:22 -03:00
Pablo Hoffman
cd8470b309
fixed crawlspider bug introduced after scheduler refactoring
2011-08-03 20:25:14 -03:00
Pablo Hoffman
cb95d7a5af
added marshal to formats supported by feed exports
2011-08-03 16:16:48 -03:00
Pablo Hoffman
884dc93ab7
Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12
2011-08-02 22:46:02 -03:00
Pablo Hoffman
3191a08560
fix start_python_console() to work with IPython >= 0.11. closes #335
2011-08-02 22:45:51 -03:00
Pablo Hoffman
888dac867f
fixed bug with scheduler.__len__() when disk queues are disabled
2011-08-02 15:56:32 -03:00
Pablo Hoffman
f8c9b40345
added __len__() method to scheduler
2011-08-02 15:46:35 -03:00
Pablo Hoffman
fbf0e9ef43
cleaned up lxml-based link extractors, and left one of them
...
--HG--
rename : scrapy/contrib/linkextractors/lxmlparser.py => scrapy/contrib/linkextractors/lxmlhtml.py
2011-08-02 15:10:25 -03:00
Pablo Hoffman
c6f29f02c4
tie request to downloader failures, so that they can be accessed from request errbacks
2011-08-02 12:02:08 -03:00
Pablo Hoffman
549725215e
Initial support for a persistent scheduler, to support pausing and resuming
...
crawls.
* requests are serialized (using marshal by default) and stored on disk, using
one queue per priority
* request priorities must be integers now
* breadh-first and depth-first crawling orders can now be configured
through a new DEPTH_PRIORITY setting (see doc). backwards compatilibty with
SCHEDULER_ORDER was kept.
* requests that can't be serialized (for example, non serializable callbacks)
are always kept in memory queues
* adapted crawl spider to work with persitent scheduler
2011-08-02 11:57:55 -03:00
Pablo Hoffman
6d989e3fb0
imported patch scheduler_single_spider.patch
2011-07-31 03:32:25 -03:00
Pablo Hoffman
4d1e01a4d4
removed obsolete profiling/ dir
2011-07-31 02:47:14 -03:00
Pablo Hoffman
f354a49d0f
added FAQ about preventing bots getting banned
2011-07-28 00:40:30 -03:00
Pablo Hoffman
ce38022665
restored support for download delays after downlaoder refactoring, also restored support for spider attributes: max_concurrent_requests and download_delay
2011-07-27 15:14:27 -03:00
Pablo Hoffman
ce7a787970
Big downloader refactoring to support real concurrency limits per domain/ip,
...
instead of global limits per spider which were a bit useless.
This removes the setting CONCURRENT_REQUESTS_PER_SPIDER and adds thre new
settings:
* CONCURRENT_REQUESTS
* CONCURRENT_REQUESTS_PER_DOMAIN
* CONCURRENT_REQUESTS_PER_IP (overrides per domain)
The AutoThrottle extension had to be disabled, but will be ported and
re-enabled soon.
2011-07-27 13:38:09 -03:00
Pablo Hoffman
a45dca32f5
removed deprecation warning (scheduled to be removed on Scrapy 0.11)
2011-07-27 13:21:58 -03:00
Pablo Hoffman
b47f5330fd
added fragment attribute ot Link object
2011-07-27 12:28:36 -03:00
Pablo Hoffman
e9ca309b6d
minor comments adjustments
2011-07-27 03:48:38 -03:00
Pablo Hoffman
c59340150f
Added cached DNS resolver based on old caching resolver extension from scrapy.contrib.resolver. This new one is *not* an extension, it comes builtin and always enabled.
2011-07-27 03:45:15 -03:00
Pablo Hoffman
90b716f7e4
downloader: removed unneeded code, and some minor refactoring
2011-07-26 19:06:52 -03:00
Pablo Hoffman
549298e38d
spidermanager: more detailed error message now that scrapy crawl command will raise the exception directly
2011-07-26 19:05:29 -03:00
Pablo Hoffman
cb9c937f50
minor code rearrangement for consistency
2011-07-26 18:49:01 -03:00
Pablo Hoffman
dd020e184f
removed rather useless (and some deprecated) docstrings
2011-07-26 18:45:51 -03:00
Pablo Hoffman
70493c754d
retry middleware: added TCPTimedOutError to exceptions to retry
2011-07-25 14:52:24 -03:00