1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-26 23:04:29 +00:00

3297 Commits

Author SHA1 Message Date
olveyra
b39cb22d83 dont discard slot when empty, just save in another dict in order to recycle if needed again.
This fix avoids to continuosly create new slot under certain cases, bug that prevents download_delay and max_concurrent_requests to work properly.

The problem arises when the slot for a given domain becomes empty, but further requests for that domain werent still created by the spider. This is typical when spider creates requests one by one, or it makes requests to multiple domains and one or more of them are created in a rate enough slow that makes slot to be empty each time the response is fetched.

The effect is that a new slot is created for each request under such conditions, and so the download_delay and max_concurrent_requests are not taking effect (because in order to apply, depends on an already existing slot for that domain).
2012-04-02 20:34:57 +00:00
Pablo Hoffman
e9184def35 make selector re() method use re.UNICODE flag to compile regexes 2012-04-01 00:41:03 -03:00
Pablo Hoffman
27018fced7 changed default user agent to Scrapy/0.15 (+http://scrapy.org) and removed no longer needed BOT_VERSION setting 2012-03-23 13:45:21 -03:00
Pablo Hoffman
731c569b5c fixed test-scrapyd.sh script after changed on insophia website 2012-03-22 16:38:28 -03:00
Pablo Hoffman
8933e2f2be added REFERER_ENABLED setting, to control referer middleware 2012-03-22 16:35:14 -03:00
Pablo Hoffman
eed34e88cd Merge pull request #103 from jsyeo/patch-1
fixed minor mistake in Request objects documentation
2012-03-20 19:49:31 -07:00
Jason Yeo
da826aa13d fixed minor mistake in Request objects documentation 2012-03-21 10:25:41 +08:00
Pablo Hoffman
175c70ad44 fixed minor defect in link extractors documentation 2012-03-20 22:56:45 -03:00
Pablo Hoffman
056a7c53d0 added artwork files properly now 2012-03-20 10:46:45 -03:00
Pablo Hoffman
aef70e8394 removed wrongly added artwork files 2012-03-20 10:45:48 -03:00
Pablo Hoffman
bcd8520f8d added sep directory with Scrapy Enhancement Proposal imported from old Trac site 2012-03-20 10:15:00 -03:00
Pablo Hoffman
c0141d154e added artwork directory (data taken from old Trac) 2012-03-20 10:14:11 -03:00
Pablo Hoffman
35fb01156e removed some obsolete remaining code related to sqlite support in scrapy 2012-03-16 11:55:55 -03:00
Pablo Hoffman
838e1dcce9 updated FormRequest tests to use HtmlResponse instead of Response, as it makes more sense 2012-03-15 11:47:02 -03:00
Pablo Hoffman
b6ae266546 Removed (very old and possibly broken) backwards compatibility support for Twisted 2.5 2012-03-15 00:28:24 -03:00
Pablo Hoffman
9fddc73ed8 removed backwards compatibility code for old scrapy versions 2012-03-06 05:42:09 -02:00
Pablo Hoffman
9a508d4638 Removed deprecated setting: CLOSESPIDER_ITEMPASSED 2012-03-06 05:26:57 -02:00
Pablo Hoffman
8b83177655 Added CLOSESPIDER_ERRORCOUNT to scrapy/default_settings.py 2012-03-06 05:26:57 -02:00
Pablo Hoffman
9006227358 bumped required python-w3lib version in debian/control 2012-03-05 20:25:38 -02:00
Daniel Graña
2909a60e95 test that default start_request return value type is a generator. refs #98 2012-03-05 17:53:20 -02:00
Pablo Hoffman
45685ea6cd Restored scrapy.utils.py26 module for backwards compatibility, with a deprecation message. This is needed because the module was used a lot by users and the change causes too much trouble 2012-03-05 17:15:49 -02:00
Daniel Graña
cc6e297062 Merge pull request #98 from kalessin/start_requests
This will break any spider that extends `start_requests` and expect a `list` as return value.

In the other side:

* [Docs](http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.BaseSpider.start_requests) says that return value is **iterable** not list: 
* Scrapy core already support consuming start_requests generator on demand so we can avoid problems like #47
* it allows extensions to change starting requests on `spider_opened` signal
2012-03-05 08:51:22 -08:00
Martin Olveyra
f6179a927e replace list by generator also in start_requests method of Sitemap
spider
2012-03-05 14:25:12 -02:00
Martin Olveyra
cc7fc33833 change start_request to return a generator instead of a list, in order
to allow to modify start_urls triggered by spider_opened signal
2012-03-05 12:49:17 -02:00
Pablo Hoffman
e521da2e2f Dropped support for Python 2.5. See: http://blog.scrapy.org/scrapy-dropping-support-for-python-25 2012-03-01 08:18:12 -02:00
Pablo Hoffman
8eb0b11f8a removed unused import 2012-02-29 17:40:30 -02:00
Pablo Hoffman
5c329b6514 Merge pull request #97 from scrapy/w3lib_encoding
Ported scrapy to use w3lib.encoding
2012-02-29 01:45:59 -08:00
Pablo Hoffman
de3a3b68dc bumped required w3lib version to 1.1, after refactoring encoding detection to use the new w3lib.encoding module 2012-02-29 07:44:22 -02:00
Pablo Hoffman
2b16ebdc11 added minor clarification on cookiejar request meta key usage 2012-02-29 07:19:01 -02:00
Pablo Hoffman
61df6b4691 Merge pull request #51 from lostsnow/master
scrapyd: support bind to a specific ip address
2012-02-28 23:49:56 -08:00
lostsnow
5afe4f50c1 scrapyd: support bind to a specific ip address 2012-02-29 13:47:40 +08:00
Daniel Graña
798169805a Adapt response encoding detection to pass test cases 2012-02-28 14:32:55 -02:00
Pablo Hoffman
81abb45000 fixed bug in new cookiejar documentation 2012-02-28 11:08:25 -02:00
Pablo Hoffman
26c8004125 added documentation for the new cookiejar Request.meta key 2012-02-27 19:58:58 -02:00
Pablo Hoffman
44d6da82fd Merge pull request #96 from kalessin/cookiesmultijar
allow to work with multiple cookie jars on the same spider
2012-02-27 13:48:43 -08:00
olveyra
c093ac5ec6 allow to work with multiple cookie jars on the same spider 2012-02-27 18:03:48 +00:00
Pablo Hoffman
4ed1a03521 Merge pull request #95 from scrapy/openmobilealliance-mimetype
Handle as html standard mimetype defined by Open Mobile Alliance
2012-02-24 10:28:32 -08:00
Daniel Graña
049f315ff4 Handle as html standard mimetype defined by Open Mobile Alliance 2012-02-24 16:16:35 -02:00
Pablo Hoffman
b1f011d740 use netloc instead of hostname in url_is_from_any_domain(). closes #50 2012-02-24 02:09:02 -02:00
Daniel Graña
08d2c2b9ee Merge branch 'GH92-image-buf-threading' 2012-02-23 19:25:28 -02:00
Daniel Graña
2dbf2a38a2 move buffer pointing to start of file before computing checksum. refs #92 2012-02-23 19:23:37 -02:00
Pablo Hoffman
e0de5f3eab Merge pull request #93 from dangra/GH92-image-buf-threading
compute image checksum before persisting images
2012-02-23 13:21:10 -08:00
Daniel Graña
3286ce4f42 Compute image checksum before persisting images. closes #92
Avoids threading issue accesing buffer
2012-02-23 19:17:21 -02:00
Pablo Hoffman
52483c55cd Merge pull request #94 from dangra/mediapipeline-cache-failures
remove as much information as possible from cached failure
2012-02-23 13:11:29 -08:00
Daniel Graña
5c73a0b1c1 remove leaking references in cached failures 2012-02-23 19:08:36 -02:00
Pablo Hoffman
e312a88582 MemoryUsage: use resident memory size (instead of virtual) for tracking memory usage 2012-02-23 17:42:02 -02:00
Pablo Hoffman
7fe7c3f3b1 MemoryUsage extension: close the spiders (instead of stopping the engine) when the limit is exceeded, providing a descriptive reason for the close. Also fixed default value of MEMUSAGE_ENABLED setting to match the documentation. 2012-02-23 17:05:06 -02:00
Pablo Hoffman
c476681c06 ported to code to use w3lib.encoding (work in progress, many tests failing yet) 2012-02-21 21:31:19 -02:00
Pablo Hoffman
6769b92493 Merge branch 'master' of github.com:scrapy/scrapy 2012-02-19 05:59:30 -02:00
Pablo Hoffman
0939106872 fixed bug in MemoryUsage extension: get_engine_status() takes exactly 1 argument (0 given) 2012-02-19 05:59:21 -02:00