scrapy

mirror of https://github.com/scrapy/scrapy.git synced 2025-02-27 23:24:01 +00:00

Author	SHA1	Message	Date
Pablo Hoffman	a3697421c0	some minor updates to documentation	2011-08-11 09:19:59 -03:00
Pablo Hoffman	5da6ffb57b	Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12	2011-08-11 09:11:19 -03:00
Pablo Hoffman	bc2d2183e9	fixed import in doc	2011-08-11 09:11:08 -03:00
Pablo Hoffman	19e6da59d8	added new downloader middleware: ChunkedTransferMiddleware	2011-08-09 03:03:25 -03:00
Pablo Hoffman	4db2a592e5	Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12	2011-08-09 01:42:11 -03:00
Pablo Hoffman	09af0866c7	scrapy.utils.python: fixed bug introduced when adding support for new IPython 0.11. refs #335	2011-08-09 01:41:48 -03:00
Pablo Hoffman	415933a1f6	scrapy.utils.python: fixed bug introduced when adding support for new IPython 0.11. refs #335	2011-08-09 01:38:31 -03:00
Pablo Hoffman	061132ce88	fixed backwards compability with images/media pipeline, after crawler singleton removal in r2758	2011-08-08 17:16:32 -03:00
Pablo Hoffman	2108517ce0	removed support for passing more than a single spider on 'scrapy crawl' command	2011-08-08 15:51:22 -03:00
Daniel Grana	436ad63930	support s3 signing on pre and post boto v2.0 --HG-- extra : rebase_source : 1d8cd5dfceeaf63975c46014b100d70f6ed36147	2011-08-08 14:29:28 -03:00
Pablo Hoffman	c64123cc63	proper fix to what r2760 is supposed to fix	2011-08-08 15:09:43 -03:00
Pablo Hoffman	984be35461	Some telnet console changes: * renamed manager alias to crawler * added aliases: spider, slot * fixed est() function	2011-08-08 15:01:08 -03:00
Pablo Hoffman	f03af7874d	fixed bug in scheduler 'has_pending_requests' method which prevented spiders to close properly in some cases	2011-08-08 14:52:54 -03:00
Daniel Grana	c35a7519c0	Correctly handle query parameters on s3:// urls	2011-08-08 13:23:45 -03:00
Pablo Hoffman	5c63b2307f	Another step towards singleton removal: deprecated crawler singleton import (from scrapy.project import crawler) by a new class method that extensions can implement to receive the crawler	2011-08-08 11:42:44 -03:00
Pablo Hoffman	0eaa1d95f6	replaced DeprecationWarning by a new ScrapyDeprecationWarning category, since the default DeprecationWarning is silenced on Python 2.7+	2011-08-08 10:39:53 -03:00
Pablo Hoffman	f7c0aeccc6	added note about engine_started signal	2011-08-07 03:57:09 -03:00
Pablo Hoffman	a2b0737a1d	scrapy.utils.sitemap: added one more case of parsing invalid sitemaps	2011-08-07 03:24:32 -03:00
Pablo Hoffman	cea0dae1b2	scrapy.utils.sitemap: added support for parsing sitemaps with wrong namespaces, found in some bogus websites	2011-08-07 03:13:55 -03:00
Pablo Hoffman	259dccaf58	moved module scrapy.core.downloader.responsetypes to scrapy.responsetypes --HG-- rename : scrapy/core/downloader/responsetypes/mime.types => scrapy/mime.types rename : scrapy/core/downloader/responsetypes/__init__.py => scrapy/responsetypes.py	2011-08-07 02:49:57 -03:00
Pablo Hoffman	9f60c27612	added setting to support disabling DNS cache: DNSCACHE_ENABLED	2011-08-05 20:41:59 -03:00
Pablo Hoffman	bb67cfd955	added MarshalDiskQueue unittests	2011-08-05 20:32:22 -03:00
Pablo Hoffman	5c938cc029	removed no longer working tests from get_engine_status()	2011-08-05 20:26:24 -03:00
Pablo Hoffman	38e193d480	MarshalDiskQueue bug fix	2011-08-05 17:06:31 -03:00
Pablo Hoffman	1ce84046d8	scheduler: bug fix to use in-memory queues when request can't be serialized by the disk-queues	2011-08-05 12:39:29 -03:00
Pablo Hoffman	76cbb6a2e6	removed wrong blocking api usage (socket.gethostbyname()) from downloader when using CONCURRENT_REQUESTS_PER_IP	2011-08-03 23:55:59 -03:00
Pablo Hoffman	ebb892e554	updated get_engine_status() after scheduler changes	2011-08-03 23:19:22 -03:00
Pablo Hoffman	cd8470b309	fixed crawlspider bug introduced after scheduler refactoring	2011-08-03 20:25:14 -03:00
Pablo Hoffman	cb95d7a5af	added marshal to formats supported by feed exports	2011-08-03 16:16:48 -03:00
Pablo Hoffman	884dc93ab7	Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12	2011-08-02 22:46:02 -03:00
Pablo Hoffman	3191a08560	fix start_python_console() to work with IPython >= 0.11. closes #335	2011-08-02 22:45:51 -03:00
Pablo Hoffman	888dac867f	fixed bug with scheduler.__len__() when disk queues are disabled	2011-08-02 15:56:32 -03:00
Pablo Hoffman	f8c9b40345	added __len__() method to scheduler	2011-08-02 15:46:35 -03:00
Pablo Hoffman	fbf0e9ef43	cleaned up lxml-based link extractors, and left one of them --HG-- rename : scrapy/contrib/linkextractors/lxmlparser.py => scrapy/contrib/linkextractors/lxmlhtml.py	2011-08-02 15:10:25 -03:00
Pablo Hoffman	c6f29f02c4	tie request to downloader failures, so that they can be accessed from request errbacks	2011-08-02 12:02:08 -03:00
Pablo Hoffman	549725215e	Initial support for a persistent scheduler, to support pausing and resuming crawls. * requests are serialized (using marshal by default) and stored on disk, using one queue per priority * request priorities must be integers now * breadh-first and depth-first crawling orders can now be configured through a new DEPTH_PRIORITY setting (see doc). backwards compatilibty with SCHEDULER_ORDER was kept. * requests that can't be serialized (for example, non serializable callbacks) are always kept in memory queues * adapted crawl spider to work with persitent scheduler	2011-08-02 11:57:55 -03:00
Pablo Hoffman	6d989e3fb0	imported patch scheduler_single_spider.patch	2011-07-31 03:32:25 -03:00
Pablo Hoffman	4d1e01a4d4	removed obsolete profiling/ dir	2011-07-31 02:47:14 -03:00
Pablo Hoffman	f354a49d0f	added FAQ about preventing bots getting banned	2011-07-28 00:40:30 -03:00
Pablo Hoffman	ce38022665	restored support for download delays after downlaoder refactoring, also restored support for spider attributes: max_concurrent_requests and download_delay	2011-07-27 15:14:27 -03:00
Pablo Hoffman	ce7a787970	Big downloader refactoring to support real concurrency limits per domain/ip, instead of global limits per spider which were a bit useless. This removes the setting CONCURRENT_REQUESTS_PER_SPIDER and adds thre new settings: * CONCURRENT_REQUESTS * CONCURRENT_REQUESTS_PER_DOMAIN * CONCURRENT_REQUESTS_PER_IP (overrides per domain) The AutoThrottle extension had to be disabled, but will be ported and re-enabled soon.	2011-07-27 13:38:09 -03:00
Pablo Hoffman	a45dca32f5	removed deprecation warning (scheduled to be removed on Scrapy 0.11)	2011-07-27 13:21:58 -03:00
Pablo Hoffman	b47f5330fd	added fragment attribute ot Link object	2011-07-27 12:28:36 -03:00
Pablo Hoffman	e9ca309b6d	minor comments adjustments	2011-07-27 03:48:38 -03:00
Pablo Hoffman	c59340150f	Added cached DNS resolver based on old caching resolver extension from scrapy.contrib.resolver. This new one is not an extension, it comes builtin and always enabled.	2011-07-27 03:45:15 -03:00
Pablo Hoffman	90b716f7e4	downloader: removed unneeded code, and some minor refactoring	2011-07-26 19:06:52 -03:00
Pablo Hoffman	549298e38d	spidermanager: more detailed error message now that scrapy crawl command will raise the exception directly	2011-07-26 19:05:29 -03:00
Pablo Hoffman	cb9c937f50	minor code rearrangement for consistency	2011-07-26 18:49:01 -03:00
Pablo Hoffman	dd020e184f	removed rather useless (and some deprecated) docstrings	2011-07-26 18:45:51 -03:00
Pablo Hoffman	70493c754d	retry middleware: added TCPTimedOutError to exceptions to retry	2011-07-25 14:52:24 -03:00

1 2 3 4 5 ...

2823 Commits