Pablo Hoffman
ce03ccd4ec
updated documentation about DEPTH_PRIORITY and DFO/BFO crawls
2011-09-23 13:22:25 -03:00
Pablo Hoffman
f850a44784
Some changes to persistent scheduler after some initial usage feedback:
...
* added LIFO queues, in addition to the original FIFO queues
* use LIFO queues (instead of FIFO queues) by default, since they resemble DFO
better which is a more convenient crawling order for most cases
* do not adjust the priority based on depth by default (DEPTH_PRIORITY = 0)
If someone does need to use strict BFO order, it can be by done by setting:
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
2011-09-23 13:03:07 -03:00
Pablo Hoffman
cfddc314ce
make SpiderState extension always available, regardless of whether there is a job dir, and make sure it always set spider.state attribute, for consistency between in-memory and on-disk runs
2011-09-23 12:56:44 -03:00
Pablo Hoffman
2559f211a8
Merge pull request #42 from noplay/deploy-git-version
...
scrapy deploy support git version
2011-09-22 14:42:41 -07:00
Julien Duponchelle
b7c436343a
scrapy deploy support git version
2011-09-21 22:17:08 +02:00
Pablo Hoffman
0f0783e525
Merge pull request #39 from kalessin/master
...
autothrottle fix
2011-09-21 13:11:28 -07:00
Martin Olveyra
dcc50201e3
fix autothrottle extension for working with new downloader
2011-09-21 17:07:21 -03:00
Pablo Hoffman
b003687389
moved rpm-install.sh to extras/
2011-09-18 06:08:28 -03:00
Daniel Graña
a5c15004f9
Merge branch '0.12'
2011-09-18 01:08:04 -03:00
Daniel Graña
fac5e5ea21
migrate hgignore to gitignore
2011-09-18 01:05:31 -03:00
Pablo Hoffman
d788ba8a44
new mechanism to override settings in scrapy commands before the Crawler object is available
2011-09-15 13:27:01 -03:00
Pablo Hoffman
77ffaa50e5
Merge pull request #38 from kalessin/master
...
_extract_links require extra parameter base_url
2011-09-14 07:58:43 -07:00
Martin Olveyra
509b05db57
_extract_links requires extra parameter base_url in order to avoid
...
exception when called from superclass method
2011-09-14 10:58:09 -03:00
Pablo Hoffman
2fcb7097bd
removed documentation header notifying about other documentation versions, as that's provided by readthedocs already
2011-09-14 02:41:01 -03:00
Pablo Hoffman
ab1c9cfc56
removed documentation header notifying about other documentation versions, as that's provided by readthedocs already
2011-09-14 02:39:32 -03:00
Pablo Hoffman
3b00b9cb12
added support for generating version from git revision
2011-09-11 11:24:12 -03:00
Pablo Hoffman
43ae7bdd89
added tests for SpiderState extension
2011-09-11 08:27:05 -03:00
Pablo Hoffman
1e43afeaea
added support for generating version from git revision, and use it in extras/makedeb.py
2011-09-09 03:03:46 -03:00
Daniel Grana
5f1b1c05f8
Do not filter requests with dont_filter attribute set in OffsiteMiddleware
2011-09-08 15:18:10 -03:00
Pablo Hoffman
bff3d31469
scrapyd: updated schedule.json response format
2011-09-04 09:29:24 -03:00
Pablo Hoffman
17cc90e3fe
added unittest for SpiderState extension
2011-09-04 08:58:23 -03:00
Pablo Hoffman
e0ec239930
restored support for spider.DOWNLOAD_DELAY attribute, with deprecation warning
2011-09-04 08:39:57 -03:00
Pablo Hoffman
c8d30c6ffa
replaced use of deprecated w3lib.url.urljoin_rfc by stdlib urlparse.urljoin
2011-09-02 19:09:21 -03:00
Pablo Hoffman
a1dbc62b45
removed CONCURRENT_SPIDERS setting (use scrapyd maxproc instead)
2011-09-02 18:27:39 -03:00
Pablo Hoffman
40f7075f11
added initial documentation about suspend and resume crawls
2011-09-02 13:12:27 -03:00
Pablo Hoffman
27dd68a690
added SpiderState extension
2011-09-02 13:06:59 -03:00
Pablo Hoffman
c382f2fc8a
fixed subtle bug in disk-based priority queues caused by serialization errors, and added tests
2011-09-02 09:40:52 -03:00
Pablo Hoffman
cca0b91000
add setting to enable logging when unserializable requests are found
2011-09-01 19:40:44 -03:00
Pablo Hoffman
789e1493e9
PickleDiskQueue: use pickle protocol 2
2011-09-01 15:12:13 -03:00
Pablo Hoffman
6a31ab667d
minor fix to doc
2011-09-01 15:08:23 -03:00
Pablo Hoffman
d98b058c21
no longer recommend using labmda's in the doc, as they're not friendly with scheduler persistence
2011-09-01 15:06:49 -03:00
Pablo Hoffman
725362fdeb
remove redundant code
2011-09-01 14:58:50 -03:00
Pablo Hoffman
76af0cdd44
updated documentation and code to use -s instead of --set option
2011-09-01 14:35:37 -03:00
Pablo Hoffman
46edfd4a9d
remove unneeded code to simplify
2011-09-01 14:29:11 -03:00
Pablo Hoffman
edefb8ac69
scrapy tool: added -s alias for --set option
2011-09-01 14:27:47 -03:00
Pablo Hoffman
75284015b5
persistent scheduler: use pickle (instead of marshal) as the default serialization format, to support serializing more objects out of the box. also removed __slots__ from Request/Response objects to make them serializable by default.
2011-09-01 14:27:29 -03:00
Daniel Grana
f1210aed0b
ignore *egg-info added by pip install -e
2011-08-29 15:01:18 -03:00
Pablo Hoffman
accac332e3
adapted test-scrapyd.sh to be compatible with older versions of mktemp, and to not hang forever is spider doesn't run for some reason
2011-08-27 01:43:32 -03:00
Pablo Hoffman
98b68ca89d
scrapyd: documented support for passing setting to spiders in schedule.json
2011-08-27 01:31:12 -03:00
Pablo Hoffman
6d6cff33ca
added scrapyd system test script to extras/test-scrapyd.sh
2011-08-27 01:23:36 -03:00
Pablo Hoffman
91b9d89ffd
moved scrapy.utils.sqlite to scrapyd.sqlite
...
--HG--
rename : scrapy/utils/sqlite.py => scrapyd/sqlite.py
rename : scrapy/tests/test_utils_sqlite.py => scrapyd/tests/test_sqlite.py
2011-08-27 01:20:57 -03:00
Pablo Hoffman
e1aff779da
removed (barely used) spider context extension, to drop dependencies with sqlite
2011-08-27 01:03:56 -03:00
Pablo Hoffman
075a2d62d3
scrapyd: added support for passing custom settings to schedule.json
2011-08-27 01:02:14 -03:00
Pablo Hoffman
ce08504853
removed class method from_settings from ISpiderManager interface
2011-08-26 09:24:01 -03:00
Pablo Hoffman
47cae5fa35
fixed unittest broken by previous commit
2011-08-24 11:31:52 -03:00
Pablo Hoffman
669b98c4fc
pass close reason to close() method of new DupeFilter
2011-08-24 11:26:35 -03:00
Pablo Hoffman
5c6b0631e2
minor doc fix
2011-08-19 11:42:03 -03:00
Pablo Hoffman
9d97e73a24
fixed priority handling on the new scheduler so that it's backwards compatible (ie. bigger priorities are higher). also fixed a few documentation bugs related to requests priority
2011-08-19 08:26:41 -03:00
Pablo Hoffman
ee40aa1223
added from_crawler class method to SpiderManager
2011-08-16 11:16:35 -03:00
Pablo Hoffman
a3697421c0
some minor updates to documentation
2011-08-11 09:19:59 -03:00