Pablo Hoffman
9c3b9f2968
fixed bug in json-rpc webservice reported in https://groups.google.com/d/topic/scrapy-users/qgVBmFybNAQ/discussion . also removed no longer supported 'run' command from extras/scrapy-ws.py
2012-05-03 12:05:40 -03:00
Pablo Hoffman
abcac4fcbd
updated maintainer to scrapinghub
2012-05-02 03:25:35 -03:00
stav
86dba76d1f
documentation indentation
2012-04-30 13:09:34 -05:00
Pablo Hoffman
d567d8efbe
added note to docs/topics/firebug.rst about google directory being shut down
2012-04-19 01:34:20 -03:00
stav
f1802289cd
small doc typo change to get the fork rolling
2012-04-11 12:05:39 -05:00
Pablo Hoffman
27018fced7
changed default user agent to Scrapy/0.15 (+ http://scrapy.org ) and removed no longer needed BOT_VERSION setting
2012-03-23 13:45:21 -03:00
Pablo Hoffman
8933e2f2be
added REFERER_ENABLED setting, to control referer middleware
2012-03-22 16:35:14 -03:00
Jason Yeo
da826aa13d
fixed minor mistake in Request objects documentation
2012-03-21 10:25:41 +08:00
Pablo Hoffman
175c70ad44
fixed minor defect in link extractors documentation
2012-03-20 22:56:45 -03:00
Pablo Hoffman
35fb01156e
removed some obsolete remaining code related to sqlite support in scrapy
2012-03-16 11:55:55 -03:00
Pablo Hoffman
2b16ebdc11
added minor clarification on cookiejar request meta key usage
2012-02-29 07:19:01 -02:00
lostsnow
5afe4f50c1
scrapyd: support bind to a specific ip address
2012-02-29 13:47:40 +08:00
Pablo Hoffman
81abb45000
fixed bug in new cookiejar documentation
2012-02-28 11:08:25 -02:00
Pablo Hoffman
26c8004125
added documentation for the new cookiejar Request.meta key
2012-02-27 19:58:58 -02:00
Pablo Hoffman
7fe7c3f3b1
MemoryUsage extension: close the spiders (instead of stopping the engine) when the limit is exceeded, providing a descriptive reason for the close. Also fixed default value of MEMUSAGE_ENABLED setting to match the documentation.
2012-02-23 17:05:06 -02:00
Pablo Hoffman
7b8942a648
updated StackTraceDump extension doc
2012-02-16 15:14:17 -02:00
Pablo Hoffman
0b0bce7f3c
scrapyd: added cancel.json and listjobs.json api methods to documentation
2012-01-05 11:23:25 -02:00
Pablo Hoffman
8f42633a94
scrapyd: added clarification about how to disable items feeds generation
2012-01-05 11:20:50 -02:00
Pablo Hoffman
dbda33efa6
scrapyd: added support for storing items by default
...
Items are stored the same way as logs, in jsonlines format.
Also renamed logs_to_keep setting to jobs_to_keep.
2012-01-03 23:08:54 -02:00
Pablo Hoffman
41fd3c4f6c
doc: removed duplicated callback argument from Request.replace()
2011-12-23 15:55:46 -02:00
Pablo Hoffman
0eeff76227
fixed formatting of scrapyd doc
2011-12-20 03:18:37 -02:00
Pablo Hoffman
992af8d38f
ubuntu repos: added support for oneiric release
2011-10-25 14:26:38 -02:00
Pablo Hoffman
c38c49d56a
fixed PickeItemExporter bug, added unittest, and added pickle to suported feed exports formats
2011-10-25 02:36:51 -02:00
Pablo Hoffman
8bdf288428
made scrapyd doc more version agnostic
2011-10-23 05:29:54 -02:00
Pablo Hoffman
431441cb52
updated documentation to remove references to old issue tracker and mercurial repos
2011-09-25 13:06:24 -03:00
Pablo Hoffman
ce03ccd4ec
updated documentation about DEPTH_PRIORITY and DFO/BFO crawls
2011-09-23 13:22:25 -03:00
Julien Duponchelle
b7c436343a
scrapy deploy support git version
2011-09-21 22:17:08 +02:00
Daniel Grana
5f1b1c05f8
Do not filter requests with dont_filter attribute set in OffsiteMiddleware
2011-09-08 15:18:10 -03:00
Pablo Hoffman
bff3d31469
scrapyd: updated schedule.json response format
2011-09-04 09:29:24 -03:00
Pablo Hoffman
a1dbc62b45
removed CONCURRENT_SPIDERS setting (use scrapyd maxproc instead)
2011-09-02 18:27:39 -03:00
Pablo Hoffman
40f7075f11
added initial documentation about suspend and resume crawls
2011-09-02 13:12:27 -03:00
Pablo Hoffman
27dd68a690
added SpiderState extension
2011-09-02 13:06:59 -03:00
Pablo Hoffman
6a31ab667d
minor fix to doc
2011-09-01 15:08:23 -03:00
Pablo Hoffman
d98b058c21
no longer recommend using labmda's in the doc, as they're not friendly with scheduler persistence
2011-09-01 15:06:49 -03:00
Pablo Hoffman
76af0cdd44
updated documentation and code to use -s instead of --set option
2011-09-01 14:35:37 -03:00
Pablo Hoffman
98b68ca89d
scrapyd: documented support for passing setting to spiders in schedule.json
2011-08-27 01:31:12 -03:00
Pablo Hoffman
5c6b0631e2
minor doc fix
2011-08-19 11:42:03 -03:00
Pablo Hoffman
9d97e73a24
fixed priority handling on the new scheduler so that it's backwards compatible (ie. bigger priorities are higher). also fixed a few documentation bugs related to requests priority
2011-08-19 08:26:41 -03:00
Pablo Hoffman
a3697421c0
some minor updates to documentation
2011-08-11 09:19:59 -03:00
Pablo Hoffman
19e6da59d8
added new downloader middleware: ChunkedTransferMiddleware
2011-08-09 03:03:25 -03:00
Pablo Hoffman
984be35461
Some telnet console changes:
...
* renamed manager alias to crawler
* added aliases: spider, slot
* fixed est() function
2011-08-08 15:01:08 -03:00
Pablo Hoffman
f7c0aeccc6
added note about engine_started signal
2011-08-07 03:57:09 -03:00
Pablo Hoffman
9f60c27612
added setting to support disabling DNS cache: DNSCACHE_ENABLED
2011-08-05 20:41:59 -03:00
Pablo Hoffman
cb95d7a5af
added marshal to formats supported by feed exports
2011-08-03 16:16:48 -03:00
Pablo Hoffman
549725215e
Initial support for a persistent scheduler, to support pausing and resuming
...
crawls.
* requests are serialized (using marshal by default) and stored on disk, using
one queue per priority
* request priorities must be integers now
* breadh-first and depth-first crawling orders can now be configured
through a new DEPTH_PRIORITY setting (see doc). backwards compatilibty with
SCHEDULER_ORDER was kept.
* requests that can't be serialized (for example, non serializable callbacks)
are always kept in memory queues
* adapted crawl spider to work with persitent scheduler
2011-08-02 11:57:55 -03:00
Pablo Hoffman
ce7a787970
Big downloader refactoring to support real concurrency limits per domain/ip,
...
instead of global limits per spider which were a bit useless.
This removes the setting CONCURRENT_REQUESTS_PER_SPIDER and adds thre new
settings:
* CONCURRENT_REQUESTS
* CONCURRENT_REQUESTS_PER_DOMAIN
* CONCURRENT_REQUESTS_PER_IP (overrides per domain)
The AutoThrottle extension had to be disabled, but will be ported and
re-enabled soon.
2011-07-27 13:38:09 -03:00
Pablo Hoffman
2ac08a713d
downloader: renamed SpiderInfo to Slot, for consistency with engine and scraper names
2011-07-22 02:06:10 -03:00
Pablo Hoffman
0e008268e1
removed SimpledbStatsCollector from scrapy code, it was moved to https://github.com/scrapinghub/scaws
2011-07-20 10:38:16 -03:00
Pablo Hoffman
84f518fc5e
More core changes:
...
* removed execution queue (replaced by newer spider queues)
* added real support for returning iterators in Spider.start_requests()
* removed support for passing urls to 'scrapy crawl' command
2011-07-15 15:18:39 -03:00
Pablo Hoffman
dbad1373f1
Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12
2011-07-13 18:44:54 -03:00