1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-25 17:24:29 +00:00

2798 Commits

Author SHA1 Message Date
Pablo Hoffman
37830da1f6 fixed wrong code in test 2011-06-10 18:27:39 -03:00
Pablo Hoffman
c4a607fc78 Raise ValueError if url has no scheme in Request constructor 2011-06-10 18:22:36 -03:00
Pablo Hoffman
88e33ad0ad Simplified Request/Response __repr__ to be the same as __str__. This improves legibility and shouldn't affect any functionality, since we never use __repr__ for reconstructing a response AFAIK. Also fixes #318 2011-06-09 00:15:53 -03:00
Pablo Hoffman
07df0edf74 scrapyd.webservice: use twisted.web multipart data parsing, to simplify code. closes #324 2011-06-08 14:17:04 -03:00
Pablo Hoffman
7643f14c88 fixed bug handling truncated gzipped responses. closes #319 2011-06-06 18:25:14 -03:00
Pablo Hoffman
48509b036a fixed some tests accidentally broken in previous commit 2011-06-06 16:11:43 -03:00
Pablo Hoffman
f793515565 make --headers output of fetch command resemble curl format, and also show request headers 2011-06-06 15:21:50 -03:00
Pablo Hoffman
03751749a8 Scheduler refactoring which introduces the following changes:
* dropped deferred stored along with requests in scheduler queues, which will
  add the ability to support persistent schedulers in the future
* moved duplicates filter into the scheduler itself, using the same
  dupe fltering class as before (DUPEFILTER_CLASS setting)
* removed scheduler middleware component to simplify, as it was only used for
  duplicates filtering and that is now done in the scheduler itself
* adapted media pipeline to work with new scheduler
* cleanup old docstrings
2011-06-06 03:16:56 -03:00
Pablo Hoffman
474cba512c simplified MemoryDebugger extension to use stats for dumping memory debugging info 2011-06-06 03:13:28 -03:00
Pablo Hoffman
5fbc32c015 call stats collector engine_stopped() after the engine is closed (to make sure all data from extensions has been collected), and added that method to documented api 2011-06-06 03:12:40 -03:00
Pablo Hoffman
35b52fcdf0 removed deprecated stat 'envinfo/request_depth_limit'. we should instead support dumping settings, for these cases 2011-06-06 01:02:58 -03:00
Pablo Hoffman
9d9c8877da added 'scrapy edit' command 2011-06-05 22:02:56 -03:00
Pablo Hoffman
ffbc9295f6 simplified DownloaderStats middleware 2011-06-05 20:03:09 -03:00
Pablo Hoffman
3d823d6f45 simplified CoreStats extension 2011-06-05 19:57:38 -03:00
Pablo Hoffman
61cc95df7c removed crawlspider v2 tests 2011-06-03 18:26:17 -03:00
Pablo Hoffman
03ae481cad removed experimental crawlspider v2 2011-06-03 18:23:23 -03:00
Pablo Hoffman
5bf733b6f6 Changed default representation of items to pretty-printed dicts. This improves
default logging by making log more readable in the default case, for both Scraped and Dropped lines.

Projects can still customize how items are represented by overriding the item's __str__ method, as usual.
2011-06-03 01:13:01 -03:00
Pablo Hoffman
1bc2339bb8 Merged item passed and item scraped concepts, as they have often proved
confusing in the past.

This means:

* original item_scraped signal was removed
* original item_passed signal was renamed to item_scraped
* old log lines "Scraped Item..." removed
* old log lines "Passed Item..." renamed to "Scraped Item..."
2011-06-03 01:13:00 -03:00
Pablo Hoffman
e6091df551 fixed doc typo 2011-05-30 09:04:31 -03:00
Pablo Hoffman
1d98fc8fb5 added spider_error signal 2011-05-29 22:38:17 -03:00
Pablo Hoffman
13d8066788 removed undocumented (and untested) extension: SpiderCloseDelay 2011-05-27 11:52:33 -03:00
Pablo Hoffman
6c369c50ca removed support for spider.dont_throttle attribute 2011-05-27 09:09:28 -03:00
Pablo Hoffman
2fa0f75f2d added COOKIES_ENABLED setting to support disabling the cookies middleware 2011-05-27 00:35:34 -03:00
Pablo Hoffman
756bf0cc06 register AutoThrottle extension by default, and made AUTOTHROTTLE_ENABLED disabled by default 2011-05-27 00:22:13 -03:00
Pablo Hoffman
dcc28b7186 added setting: AUTOTHROTTLE_ENABLED 2011-05-22 18:31:36 -03:00
Pablo Hoffman
110cd05296 added Spider.dont_throttle attribute to disable AutoThrottle extension per spider 2011-05-22 18:26:38 -03:00
Shane Evans
88dbe2ae87 fix error messages due to fetching pages during shutdown process
This version keeps the faster approach of not processing request callbacks when engine is shutting down
2011-05-20 14:35:37 +01:00
Pablo Hoffman
3897e33612 fixed stupid bug in scheduler introduced in previous change 2011-05-20 03:52:41 -03:00
Pablo Hoffman
70b0e42ca6 removed unused imports 2011-05-20 03:26:07 -03:00
Pablo Hoffman
d72d3f4607 stack trace dump extension: also dump engine status, and support triggering it with SIGQUIT, besides SIGUSR2 2011-05-20 03:25:00 -03:00
Pablo Hoffman
6069b0e5b2 Fixed 100% cpu loop that ocurred in some cases where Scrapy was shutting donw 2011-05-20 03:21:36 -03:00
Pablo Hoffman
951ba507f9 Removed support for default values in Scrapy items, which have proven confusing in the past 2011-05-19 21:42:46 -03:00
Pablo Hoffman
503f302010 removed remaining references to scheduler middleware from doc, as it will be removed on next release 2011-05-18 19:48:48 -03:00
Pablo Hoffman
3fd17432cf fixed outdated documentation 2011-05-18 14:46:20 -03:00
Pablo Hoffman
9016e7e993 added role to link to scrapy source code (not yet used) 2011-05-18 14:43:34 -03:00
Pablo Hoffman
a98e9e054b minor fix to spider closed count stat 2011-05-18 12:45:19 -03:00
Pablo Hoffman
cd85c12c33 Some Link extractor improvements:
* added support for ignoring common file extensions that are not followed if
  they occur in links
* fixed link extractor documentation issues
* slighly improved performance of applying filters
* added link to link extractors doc from documentation index
2011-05-18 12:32:34 -03:00
Pablo Hoffman
495152bd50 disabled verbose depth stats collection by default, added DEPTH_STATS_VERBOSE setting to enable it 2011-05-18 11:04:48 -03:00
Pablo Hoffman
accb6ed830 dump stats to log by default (ie. change default value of STATS_DUMP to True) 2011-05-17 22:42:05 -03:00
Pablo Hoffman
315457c2ef added support for -a option to runspider command (like it works with crawl command) 2011-05-17 22:07:49 -03:00
Pablo Hoffman
ab6a4d053f minor code improvement 2011-05-16 09:56:32 -03:00
Pablo Hoffman
d29eccba56 AutoThrottle: added missing line to connect spider_closed hanlder 2011-05-16 09:42:44 -03:00
Pablo Hoffman
403dc536e2 improved documentation of AutoThrottle extension 2011-05-15 06:07:26 -03:00
Pablo Hoffman
2b933a4a8c added AutoThrottle extension (still under testing, not yet enabled by default) 2011-05-15 05:39:58 -03:00
Pablo Hoffman
bd8d7f5cf4 collect download latencies in 'download_latency' request/response meta key 2011-05-15 05:24:01 -03:00
Pablo Hoffman
668dfcabf3 send the response_received signal from the engine, after tying it with the corresponding request 2011-05-15 05:20:14 -03:00
Pablo Hoffman
f9aa819b06 scraper: minor performance improvement by using collections.deque() as in downloader (see previous commit) 2011-05-14 21:50:14 -03:00
Pablo Hoffman
079de67719 downloader: minor performance improvement by using collections.deque() to avoid the list.pop(0) call which is O(n) 2011-05-14 21:47:25 -03:00
Pablo Hoffman
7e62a0a1a1 Downloader: Added support for dynamically adjusting download delay and maximum concurrent requests 2011-05-14 21:35:46 -03:00
Pablo Hoffman
bac46ba438 make sure Request.method is always str 2011-05-02 01:11:19 -03:00