Pablo Hoffman
3d823d6f45
simplified CoreStats extension
2011-06-05 19:57:38 -03:00
Pablo Hoffman
61cc95df7c
removed crawlspider v2 tests
2011-06-03 18:26:17 -03:00
Pablo Hoffman
03ae481cad
removed experimental crawlspider v2
2011-06-03 18:23:23 -03:00
Pablo Hoffman
5bf733b6f6
Changed default representation of items to pretty-printed dicts. This improves
...
default logging by making log more readable in the default case, for both Scraped and Dropped lines.
Projects can still customize how items are represented by overriding the item's __str__ method, as usual.
2011-06-03 01:13:01 -03:00
Pablo Hoffman
1bc2339bb8
Merged item passed and item scraped concepts, as they have often proved
...
confusing in the past.
This means:
* original item_scraped signal was removed
* original item_passed signal was renamed to item_scraped
* old log lines "Scraped Item..." removed
* old log lines "Passed Item..." renamed to "Scraped Item..."
2011-06-03 01:13:00 -03:00
Pablo Hoffman
e6091df551
fixed doc typo
2011-05-30 09:04:31 -03:00
Pablo Hoffman
1d98fc8fb5
added spider_error signal
2011-05-29 22:38:17 -03:00
Pablo Hoffman
13d8066788
removed undocumented (and untested) extension: SpiderCloseDelay
2011-05-27 11:52:33 -03:00
Pablo Hoffman
6c369c50ca
removed support for spider.dont_throttle attribute
2011-05-27 09:09:28 -03:00
Pablo Hoffman
2fa0f75f2d
added COOKIES_ENABLED setting to support disabling the cookies middleware
2011-05-27 00:35:34 -03:00
Pablo Hoffman
756bf0cc06
register AutoThrottle extension by default, and made AUTOTHROTTLE_ENABLED disabled by default
2011-05-27 00:22:13 -03:00
Pablo Hoffman
dcc28b7186
added setting: AUTOTHROTTLE_ENABLED
2011-05-22 18:31:36 -03:00
Pablo Hoffman
110cd05296
added Spider.dont_throttle attribute to disable AutoThrottle extension per spider
2011-05-22 18:26:38 -03:00
Shane Evans
88dbe2ae87
fix error messages due to fetching pages during shutdown process
...
This version keeps the faster approach of not processing request callbacks when engine is shutting down
2011-05-20 14:35:37 +01:00
Pablo Hoffman
3897e33612
fixed stupid bug in scheduler introduced in previous change
2011-05-20 03:52:41 -03:00
Pablo Hoffman
70b0e42ca6
removed unused imports
2011-05-20 03:26:07 -03:00
Pablo Hoffman
d72d3f4607
stack trace dump extension: also dump engine status, and support triggering it with SIGQUIT, besides SIGUSR2
2011-05-20 03:25:00 -03:00
Pablo Hoffman
6069b0e5b2
Fixed 100% cpu loop that ocurred in some cases where Scrapy was shutting donw
2011-05-20 03:21:36 -03:00
Pablo Hoffman
951ba507f9
Removed support for default values in Scrapy items, which have proven confusing in the past
2011-05-19 21:42:46 -03:00
Pablo Hoffman
503f302010
removed remaining references to scheduler middleware from doc, as it will be removed on next release
2011-05-18 19:48:48 -03:00
Pablo Hoffman
3fd17432cf
fixed outdated documentation
2011-05-18 14:46:20 -03:00
Pablo Hoffman
9016e7e993
added role to link to scrapy source code (not yet used)
2011-05-18 14:43:34 -03:00
Pablo Hoffman
a98e9e054b
minor fix to spider closed count stat
2011-05-18 12:45:19 -03:00
Pablo Hoffman
cd85c12c33
Some Link extractor improvements:
...
* added support for ignoring common file extensions that are not followed if
they occur in links
* fixed link extractor documentation issues
* slighly improved performance of applying filters
* added link to link extractors doc from documentation index
2011-05-18 12:32:34 -03:00
Pablo Hoffman
495152bd50
disabled verbose depth stats collection by default, added DEPTH_STATS_VERBOSE setting to enable it
2011-05-18 11:04:48 -03:00
Pablo Hoffman
accb6ed830
dump stats to log by default (ie. change default value of STATS_DUMP to True)
2011-05-17 22:42:05 -03:00
Pablo Hoffman
315457c2ef
added support for -a option to runspider command (like it works with crawl command)
2011-05-17 22:07:49 -03:00
Pablo Hoffman
ab6a4d053f
minor code improvement
2011-05-16 09:56:32 -03:00
Pablo Hoffman
d29eccba56
AutoThrottle: added missing line to connect spider_closed hanlder
2011-05-16 09:42:44 -03:00
Pablo Hoffman
403dc536e2
improved documentation of AutoThrottle extension
2011-05-15 06:07:26 -03:00
Pablo Hoffman
2b933a4a8c
added AutoThrottle extension (still under testing, not yet enabled by default)
2011-05-15 05:39:58 -03:00
Pablo Hoffman
bd8d7f5cf4
collect download latencies in 'download_latency' request/response meta key
2011-05-15 05:24:01 -03:00
Pablo Hoffman
668dfcabf3
send the response_received signal from the engine, after tying it with the corresponding request
2011-05-15 05:20:14 -03:00
Pablo Hoffman
f9aa819b06
scraper: minor performance improvement by using collections.deque() as in downloader (see previous commit)
2011-05-14 21:50:14 -03:00
Pablo Hoffman
079de67719
downloader: minor performance improvement by using collections.deque() to avoid the list.pop(0) call which is O(n)
2011-05-14 21:47:25 -03:00
Pablo Hoffman
7e62a0a1a1
Downloader: Added support for dynamically adjusting download delay and maximum concurrent requests
2011-05-14 21:35:46 -03:00
Pablo Hoffman
bac46ba438
make sure Request.method is always str
2011-05-02 01:11:19 -03:00
Pablo Hoffman
afa23688c6
fixed bug in scrapy.http.Headers: values weren't being encoded to str when passed as lists
2011-05-01 19:39:13 -03:00
Pablo Hoffman
7f97259ba7
added w3lib to requirements, in installation guide
2011-05-01 11:14:57 -03:00
Pablo Hoffman
718428c0ab
debian/control: added python-setuptools to Recommends, because it's need by 'scrapy deploy' command
2011-05-01 11:00:02 -03:00
Pablo Hoffman
d08281a44f
Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12
2011-04-30 01:35:43 -03:00
Pablo Hoffman
4a83167698
fixed small doc typo
2011-04-30 01:35:30 -03:00
Pablo Hoffman
cf572bb642
removed experimental examples
2011-04-28 18:07:23 -03:00
Pablo Hoffman
bb2b67c862
updated tutorial to use 'dmoz' as the name of the spider instead of 'dmoz.org', so that it's more similar to the dirbot example project
2011-04-28 09:31:57 -03:00
Pablo Hoffman
bf73002428
removed googledir example, replaced by dirbot project on github. updated docs accordingly
2011-04-28 02:28:39 -03:00
Pablo Hoffman
b12dd76bb8
Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12
2011-04-25 09:31:18 -03:00
Pablo Hoffman
678f08bc1b
added warning about using 'parse' as callback in crawl spider rules
2011-04-25 09:30:42 -03:00
Pablo Hoffman
18d303b5f1
ported internal scrapy.utils imports to w3lib
2011-04-19 01:33:52 -03:00
Pablo Hoffman
fcc8d73840
Removed scrapy.contrib.ibl module (and submodules). They have been moved to a new library "scrapely". See https://github.com/scrapy/scrapely
2011-04-19 01:04:22 -03:00
Pablo Hoffman
ebcbb9f453
debian: added python-w3lib package to dependencies
2011-04-19 00:55:08 -03:00