1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-25 07:03:56 +00:00

329 Commits

Author SHA1 Message Date
Pablo Hoffman
0eeff76227 fixed formatting of scrapyd doc 2011-12-20 03:18:37 -02:00
Pablo Hoffman
992af8d38f ubuntu repos: added support for oneiric release 2011-10-25 14:26:38 -02:00
Pablo Hoffman
c38c49d56a fixed PickeItemExporter bug, added unittest, and added pickle to suported feed exports formats 2011-10-25 02:36:51 -02:00
Pablo Hoffman
8bdf288428 made scrapyd doc more version agnostic 2011-10-23 05:29:54 -02:00
Pablo Hoffman
431441cb52 updated documentation to remove references to old issue tracker and mercurial repos 2011-09-25 13:06:24 -03:00
Pablo Hoffman
ce03ccd4ec updated documentation about DEPTH_PRIORITY and DFO/BFO crawls 2011-09-23 13:22:25 -03:00
Julien Duponchelle
b7c436343a scrapy deploy support git version 2011-09-21 22:17:08 +02:00
Daniel Grana
5f1b1c05f8 Do not filter requests with dont_filter attribute set in OffsiteMiddleware 2011-09-08 15:18:10 -03:00
Pablo Hoffman
bff3d31469 scrapyd: updated schedule.json response format 2011-09-04 09:29:24 -03:00
Pablo Hoffman
a1dbc62b45 removed CONCURRENT_SPIDERS setting (use scrapyd maxproc instead) 2011-09-02 18:27:39 -03:00
Pablo Hoffman
40f7075f11 added initial documentation about suspend and resume crawls 2011-09-02 13:12:27 -03:00
Pablo Hoffman
27dd68a690 added SpiderState extension 2011-09-02 13:06:59 -03:00
Pablo Hoffman
6a31ab667d minor fix to doc 2011-09-01 15:08:23 -03:00
Pablo Hoffman
d98b058c21 no longer recommend using labmda's in the doc, as they're not friendly with scheduler persistence 2011-09-01 15:06:49 -03:00
Pablo Hoffman
76af0cdd44 updated documentation and code to use -s instead of --set option 2011-09-01 14:35:37 -03:00
Pablo Hoffman
98b68ca89d scrapyd: documented support for passing setting to spiders in schedule.json 2011-08-27 01:31:12 -03:00
Pablo Hoffman
5c6b0631e2 minor doc fix 2011-08-19 11:42:03 -03:00
Pablo Hoffman
9d97e73a24 fixed priority handling on the new scheduler so that it's backwards compatible (ie. bigger priorities are higher). also fixed a few documentation bugs related to requests priority 2011-08-19 08:26:41 -03:00
Pablo Hoffman
a3697421c0 some minor updates to documentation 2011-08-11 09:19:59 -03:00
Pablo Hoffman
19e6da59d8 added new downloader middleware: ChunkedTransferMiddleware 2011-08-09 03:03:25 -03:00
Pablo Hoffman
984be35461 Some telnet console changes:
* renamed manager alias to crawler
* added aliases: spider, slot
* fixed est() function
2011-08-08 15:01:08 -03:00
Pablo Hoffman
f7c0aeccc6 added note about engine_started signal 2011-08-07 03:57:09 -03:00
Pablo Hoffman
9f60c27612 added setting to support disabling DNS cache: DNSCACHE_ENABLED 2011-08-05 20:41:59 -03:00
Pablo Hoffman
cb95d7a5af added marshal to formats supported by feed exports 2011-08-03 16:16:48 -03:00
Pablo Hoffman
549725215e Initial support for a persistent scheduler, to support pausing and resuming
crawls.

* requests are serialized (using marshal by default) and stored on disk, using
  one queue per priority
* request priorities must be integers now
* breadh-first and depth-first crawling orders can now be configured
  through a new DEPTH_PRIORITY setting (see doc). backwards compatilibty with
  SCHEDULER_ORDER was kept.
* requests that can't be serialized (for example, non serializable callbacks)
  are always kept in memory queues
* adapted crawl spider to work with persitent scheduler
2011-08-02 11:57:55 -03:00
Pablo Hoffman
ce7a787970 Big downloader refactoring to support real concurrency limits per domain/ip,
instead of global limits per spider which were a bit useless.

This removes the setting CONCURRENT_REQUESTS_PER_SPIDER and adds thre new
settings:

* CONCURRENT_REQUESTS
* CONCURRENT_REQUESTS_PER_DOMAIN
* CONCURRENT_REQUESTS_PER_IP (overrides per domain)

The AutoThrottle extension had to be disabled, but will be ported and
re-enabled soon.
2011-07-27 13:38:09 -03:00
Pablo Hoffman
2ac08a713d downloader: renamed SpiderInfo to Slot, for consistency with engine and scraper names 2011-07-22 02:06:10 -03:00
Pablo Hoffman
0e008268e1 removed SimpledbStatsCollector from scrapy code, it was moved to https://github.com/scrapinghub/scaws 2011-07-20 10:38:16 -03:00
Pablo Hoffman
84f518fc5e More core changes:
* removed execution queue (replaced by newer spider queues)
* added real support for returning iterators in Spider.start_requests()
* removed support for passing urls to 'scrapy crawl' command
2011-07-15 15:18:39 -03:00
Pablo Hoffman
dbad1373f1 Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12 2011-07-13 18:44:54 -03:00
Pablo Hoffman
18cb4ff1d8 added natty to list of supporte ubuntu distros 2011-07-13 18:43:52 -03:00
Pablo Hoffman
39a2ea97c8 redirect mw: added REDIRECT_ENABLED setting and documented the other settings 2011-07-13 14:18:15 -03:00
Pablo Hoffman
541ed3913b retry middleware: added RETRY_ENABLED setting and documented the other settings more properly, also improved messages when no longer retrying requests 2011-07-13 11:55:05 -03:00
Pablo Hoffman
4fde1ef94d added CloseSpider exception, to manually close spiders 2011-07-12 14:24:10 -03:00
Pablo Hoffman
db5cae7c03 SitemapSpider: added support for filtering which sitemaps to follow (patch contributed by Rolando Espinoza). closes #330 2011-06-23 18:18:29 -03:00
Pablo Hoffman
57c43fdce6 added SitemapSpider, with tests and doc 2011-06-15 11:54:34 -03:00
Pablo Hoffman
91dc46539f added LogStats extension for periodically logging basic stats (like crawled pages and scraped items) 2011-06-14 00:50:05 -03:00
Pablo Hoffman
841e9913db renamed CLOSESPIDER_ITEMPASSED setting to CLOSESPIDER_ITEMCOUNT, to follow the refactoring done in r2630 2011-06-13 16:58:51 -03:00
Pablo Hoffman
474cba512c simplified MemoryDebugger extension to use stats for dumping memory debugging info 2011-06-06 03:13:28 -03:00
Pablo Hoffman
5fbc32c015 call stats collector engine_stopped() after the engine is closed (to make sure all data from extensions has been collected), and added that method to documented api 2011-06-06 03:12:40 -03:00
Pablo Hoffman
9d9c8877da added 'scrapy edit' command 2011-06-05 22:02:56 -03:00
Pablo Hoffman
1bc2339bb8 Merged item passed and item scraped concepts, as they have often proved
confusing in the past.

This means:

* original item_scraped signal was removed
* original item_passed signal was renamed to item_scraped
* old log lines "Scraped Item..." removed
* old log lines "Passed Item..." renamed to "Scraped Item..."
2011-06-03 01:13:00 -03:00
Pablo Hoffman
e6091df551 fixed doc typo 2011-05-30 09:04:31 -03:00
Pablo Hoffman
1d98fc8fb5 added spider_error signal 2011-05-29 22:38:17 -03:00
Pablo Hoffman
2fa0f75f2d added COOKIES_ENABLED setting to support disabling the cookies middleware 2011-05-27 00:35:34 -03:00
Pablo Hoffman
d72d3f4607 stack trace dump extension: also dump engine status, and support triggering it with SIGQUIT, besides SIGUSR2 2011-05-20 03:25:00 -03:00
Pablo Hoffman
951ba507f9 Removed support for default values in Scrapy items, which have proven confusing in the past 2011-05-19 21:42:46 -03:00
Pablo Hoffman
503f302010 removed remaining references to scheduler middleware from doc, as it will be removed on next release 2011-05-18 19:48:48 -03:00
Pablo Hoffman
3fd17432cf fixed outdated documentation 2011-05-18 14:46:20 -03:00
Pablo Hoffman
cd85c12c33 Some Link extractor improvements:
* added support for ignoring common file extensions that are not followed if
  they occur in links
* fixed link extractor documentation issues
* slighly improved performance of applying filters
* added link to link extractors doc from documentation index
2011-05-18 12:32:34 -03:00