1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-26 00:03:57 +00:00

962 Commits

Author SHA1 Message Date
Pablo Hoffman
bff3d31469 scrapyd: updated schedule.json response format 2011-09-04 09:29:24 -03:00
Pablo Hoffman
a1dbc62b45 removed CONCURRENT_SPIDERS setting (use scrapyd maxproc instead) 2011-09-02 18:27:39 -03:00
Pablo Hoffman
40f7075f11 added initial documentation about suspend and resume crawls 2011-09-02 13:12:27 -03:00
Pablo Hoffman
27dd68a690 added SpiderState extension 2011-09-02 13:06:59 -03:00
Pablo Hoffman
6a31ab667d minor fix to doc 2011-09-01 15:08:23 -03:00
Pablo Hoffman
d98b058c21 no longer recommend using labmda's in the doc, as they're not friendly with scheduler persistence 2011-09-01 15:06:49 -03:00
Pablo Hoffman
76af0cdd44 updated documentation and code to use -s instead of --set option 2011-09-01 14:35:37 -03:00
Pablo Hoffman
98b68ca89d scrapyd: documented support for passing setting to spiders in schedule.json 2011-08-27 01:31:12 -03:00
Pablo Hoffman
5c6b0631e2 minor doc fix 2011-08-19 11:42:03 -03:00
Pablo Hoffman
9d97e73a24 fixed priority handling on the new scheduler so that it's backwards compatible (ie. bigger priorities are higher). also fixed a few documentation bugs related to requests priority 2011-08-19 08:26:41 -03:00
Pablo Hoffman
a3697421c0 some minor updates to documentation 2011-08-11 09:19:59 -03:00
Pablo Hoffman
5da6ffb57b Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12 2011-08-11 09:11:19 -03:00
Pablo Hoffman
bc2d2183e9 fixed import in doc 2011-08-11 09:11:08 -03:00
Pablo Hoffman
19e6da59d8 added new downloader middleware: ChunkedTransferMiddleware 2011-08-09 03:03:25 -03:00
Pablo Hoffman
984be35461 Some telnet console changes:
* renamed manager alias to crawler
* added aliases: spider, slot
* fixed est() function
2011-08-08 15:01:08 -03:00
Pablo Hoffman
f7c0aeccc6 added note about engine_started signal 2011-08-07 03:57:09 -03:00
Pablo Hoffman
9f60c27612 added setting to support disabling DNS cache: DNSCACHE_ENABLED 2011-08-05 20:41:59 -03:00
Pablo Hoffman
cb95d7a5af added marshal to formats supported by feed exports 2011-08-03 16:16:48 -03:00
Pablo Hoffman
549725215e Initial support for a persistent scheduler, to support pausing and resuming
crawls.

* requests are serialized (using marshal by default) and stored on disk, using
  one queue per priority
* request priorities must be integers now
* breadh-first and depth-first crawling orders can now be configured
  through a new DEPTH_PRIORITY setting (see doc). backwards compatilibty with
  SCHEDULER_ORDER was kept.
* requests that can't be serialized (for example, non serializable callbacks)
  are always kept in memory queues
* adapted crawl spider to work with persitent scheduler
2011-08-02 11:57:55 -03:00
Pablo Hoffman
f354a49d0f added FAQ about preventing bots getting banned 2011-07-28 00:40:30 -03:00
Pablo Hoffman
ce7a787970 Big downloader refactoring to support real concurrency limits per domain/ip,
instead of global limits per spider which were a bit useless.

This removes the setting CONCURRENT_REQUESTS_PER_SPIDER and adds thre new
settings:

* CONCURRENT_REQUESTS
* CONCURRENT_REQUESTS_PER_DOMAIN
* CONCURRENT_REQUESTS_PER_IP (overrides per domain)

The AutoThrottle extension had to be disabled, but will be ported and
re-enabled soon.
2011-07-27 13:38:09 -03:00
Pablo Hoffman
c59340150f Added cached DNS resolver based on old caching resolver extension from scrapy.contrib.resolver. This new one is *not* an extension, it comes builtin and always enabled. 2011-07-27 03:45:15 -03:00
Pablo Hoffman
2ac08a713d downloader: renamed SpiderInfo to Slot, for consistency with engine and scraper names 2011-07-22 02:06:10 -03:00
Pablo Hoffman
0e008268e1 removed SimpledbStatsCollector from scrapy code, it was moved to https://github.com/scrapinghub/scaws 2011-07-20 10:38:16 -03:00
Pablo Hoffman
b6b0a54d9f removed FAQ entry 2011-07-20 01:31:36 -03:00
Pablo Hoffman
e3f640c7bf added FAQ entry about scrapy deploy issue on Mac + Python 2.5 2011-07-19 19:53:32 -03:00
Pablo Hoffman
84f518fc5e More core changes:
* removed execution queue (replaced by newer spider queues)
* added real support for returning iterators in Spider.start_requests()
* removed support for passing urls to 'scrapy crawl' command
2011-07-15 15:18:39 -03:00
Pablo Hoffman
dbad1373f1 Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12 2011-07-13 18:44:54 -03:00
Pablo Hoffman
18cb4ff1d8 added natty to list of supporte ubuntu distros 2011-07-13 18:43:52 -03:00
Pablo Hoffman
39a2ea97c8 redirect mw: added REDIRECT_ENABLED setting and documented the other settings 2011-07-13 14:18:15 -03:00
Pablo Hoffman
541ed3913b retry middleware: added RETRY_ENABLED setting and documented the other settings more properly, also improved messages when no longer retrying requests 2011-07-13 11:55:05 -03:00
Pablo Hoffman
763f3dc628 minor update to doc 2011-07-12 19:56:39 -03:00
Pablo Hoffman
bfda9ec319 added clarification about scrapy versioning including the recently adopted odd/even versioning scheme
--HG--
rename : docs/api-stability.rst => docs/versioning.rst
2011-07-12 19:53:23 -03:00
Pablo Hoffman
4fde1ef94d added CloseSpider exception, to manually close spiders 2011-07-12 14:24:10 -03:00
Pablo Hoffman
db5cae7c03 SitemapSpider: added support for filtering which sitemaps to follow (patch contributed by Rolando Espinoza). closes #330 2011-06-23 18:18:29 -03:00
Pablo Hoffman
57c43fdce6 added SitemapSpider, with tests and doc 2011-06-15 11:54:34 -03:00
Pablo Hoffman
91dc46539f added LogStats extension for periodically logging basic stats (like crawled pages and scraped items) 2011-06-14 00:50:05 -03:00
Pablo Hoffman
841e9913db renamed CLOSESPIDER_ITEMPASSED setting to CLOSESPIDER_ITEMCOUNT, to follow the refactoring done in r2630 2011-06-13 16:58:51 -03:00
Pablo Hoffman
474cba512c simplified MemoryDebugger extension to use stats for dumping memory debugging info 2011-06-06 03:13:28 -03:00
Pablo Hoffman
5fbc32c015 call stats collector engine_stopped() after the engine is closed (to make sure all data from extensions has been collected), and added that method to documented api 2011-06-06 03:12:40 -03:00
Pablo Hoffman
9d9c8877da added 'scrapy edit' command 2011-06-05 22:02:56 -03:00
Pablo Hoffman
03ae481cad removed experimental crawlspider v2 2011-06-03 18:23:23 -03:00
Pablo Hoffman
5bf733b6f6 Changed default representation of items to pretty-printed dicts. This improves
default logging by making log more readable in the default case, for both Scraped and Dropped lines.

Projects can still customize how items are represented by overriding the item's __str__ method, as usual.
2011-06-03 01:13:01 -03:00
Pablo Hoffman
1bc2339bb8 Merged item passed and item scraped concepts, as they have often proved
confusing in the past.

This means:

* original item_scraped signal was removed
* original item_passed signal was renamed to item_scraped
* old log lines "Scraped Item..." removed
* old log lines "Passed Item..." renamed to "Scraped Item..."
2011-06-03 01:13:00 -03:00
Pablo Hoffman
e6091df551 fixed doc typo 2011-05-30 09:04:31 -03:00
Pablo Hoffman
1d98fc8fb5 added spider_error signal 2011-05-29 22:38:17 -03:00
Pablo Hoffman
2fa0f75f2d added COOKIES_ENABLED setting to support disabling the cookies middleware 2011-05-27 00:35:34 -03:00
Pablo Hoffman
d72d3f4607 stack trace dump extension: also dump engine status, and support triggering it with SIGQUIT, besides SIGUSR2 2011-05-20 03:25:00 -03:00
Pablo Hoffman
951ba507f9 Removed support for default values in Scrapy items, which have proven confusing in the past 2011-05-19 21:42:46 -03:00
Pablo Hoffman
503f302010 removed remaining references to scheduler middleware from doc, as it will be removed on next release 2011-05-18 19:48:48 -03:00