scrapy

mirror of https://github.com/scrapy/scrapy.git synced 2025-02-25 07:03:56 +00:00

Author	SHA1	Message	Date
Pablo Hoffman	0eeff76227	fixed formatting of scrapyd doc	2011-12-20 03:18:37 -02:00
Pablo Hoffman	992af8d38f	ubuntu repos: added support for oneiric release	2011-10-25 14:26:38 -02:00
Pablo Hoffman	c38c49d56a	fixed PickeItemExporter bug, added unittest, and added pickle to suported feed exports formats	2011-10-25 02:36:51 -02:00
Pablo Hoffman	8bdf288428	made scrapyd doc more version agnostic	2011-10-23 05:29:54 -02:00
Pablo Hoffman	431441cb52	updated documentation to remove references to old issue tracker and mercurial repos	2011-09-25 13:06:24 -03:00
Pablo Hoffman	ce03ccd4ec	updated documentation about DEPTH_PRIORITY and DFO/BFO crawls	2011-09-23 13:22:25 -03:00
Julien Duponchelle	b7c436343a	scrapy deploy support git version	2011-09-21 22:17:08 +02:00
Daniel Grana	5f1b1c05f8	Do not filter requests with dont_filter attribute set in OffsiteMiddleware	2011-09-08 15:18:10 -03:00
Pablo Hoffman	bff3d31469	scrapyd: updated schedule.json response format	2011-09-04 09:29:24 -03:00
Pablo Hoffman	a1dbc62b45	removed CONCURRENT_SPIDERS setting (use scrapyd maxproc instead)	2011-09-02 18:27:39 -03:00
Pablo Hoffman	40f7075f11	added initial documentation about suspend and resume crawls	2011-09-02 13:12:27 -03:00
Pablo Hoffman	27dd68a690	added SpiderState extension	2011-09-02 13:06:59 -03:00
Pablo Hoffman	6a31ab667d	minor fix to doc	2011-09-01 15:08:23 -03:00
Pablo Hoffman	d98b058c21	no longer recommend using labmda's in the doc, as they're not friendly with scheduler persistence	2011-09-01 15:06:49 -03:00
Pablo Hoffman	76af0cdd44	updated documentation and code to use -s instead of --set option	2011-09-01 14:35:37 -03:00
Pablo Hoffman	98b68ca89d	scrapyd: documented support for passing setting to spiders in schedule.json	2011-08-27 01:31:12 -03:00
Pablo Hoffman	5c6b0631e2	minor doc fix	2011-08-19 11:42:03 -03:00
Pablo Hoffman	9d97e73a24	fixed priority handling on the new scheduler so that it's backwards compatible (ie. bigger priorities are higher). also fixed a few documentation bugs related to requests priority	2011-08-19 08:26:41 -03:00
Pablo Hoffman	a3697421c0	some minor updates to documentation	2011-08-11 09:19:59 -03:00
Pablo Hoffman	19e6da59d8	added new downloader middleware: ChunkedTransferMiddleware	2011-08-09 03:03:25 -03:00
Pablo Hoffman	984be35461	Some telnet console changes: * renamed manager alias to crawler * added aliases: spider, slot * fixed est() function	2011-08-08 15:01:08 -03:00
Pablo Hoffman	f7c0aeccc6	added note about engine_started signal	2011-08-07 03:57:09 -03:00
Pablo Hoffman	9f60c27612	added setting to support disabling DNS cache: DNSCACHE_ENABLED	2011-08-05 20:41:59 -03:00
Pablo Hoffman	cb95d7a5af	added marshal to formats supported by feed exports	2011-08-03 16:16:48 -03:00
Pablo Hoffman	549725215e	Initial support for a persistent scheduler, to support pausing and resuming crawls. * requests are serialized (using marshal by default) and stored on disk, using one queue per priority * request priorities must be integers now * breadh-first and depth-first crawling orders can now be configured through a new DEPTH_PRIORITY setting (see doc). backwards compatilibty with SCHEDULER_ORDER was kept. * requests that can't be serialized (for example, non serializable callbacks) are always kept in memory queues * adapted crawl spider to work with persitent scheduler	2011-08-02 11:57:55 -03:00
Pablo Hoffman	ce7a787970	Big downloader refactoring to support real concurrency limits per domain/ip, instead of global limits per spider which were a bit useless. This removes the setting CONCURRENT_REQUESTS_PER_SPIDER and adds thre new settings: * CONCURRENT_REQUESTS * CONCURRENT_REQUESTS_PER_DOMAIN * CONCURRENT_REQUESTS_PER_IP (overrides per domain) The AutoThrottle extension had to be disabled, but will be ported and re-enabled soon.	2011-07-27 13:38:09 -03:00
Pablo Hoffman	2ac08a713d	downloader: renamed SpiderInfo to Slot, for consistency with engine and scraper names	2011-07-22 02:06:10 -03:00
Pablo Hoffman	0e008268e1	removed SimpledbStatsCollector from scrapy code, it was moved to https://github.com/scrapinghub/scaws	2011-07-20 10:38:16 -03:00
Pablo Hoffman	84f518fc5e	More core changes: * removed execution queue (replaced by newer spider queues) * added real support for returning iterators in Spider.start_requests() * removed support for passing urls to 'scrapy crawl' command	2011-07-15 15:18:39 -03:00
Pablo Hoffman	dbad1373f1	Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12	2011-07-13 18:44:54 -03:00
Pablo Hoffman	18cb4ff1d8	added natty to list of supporte ubuntu distros	2011-07-13 18:43:52 -03:00
Pablo Hoffman	39a2ea97c8	redirect mw: added REDIRECT_ENABLED setting and documented the other settings	2011-07-13 14:18:15 -03:00
Pablo Hoffman	541ed3913b	retry middleware: added RETRY_ENABLED setting and documented the other settings more properly, also improved messages when no longer retrying requests	2011-07-13 11:55:05 -03:00
Pablo Hoffman	4fde1ef94d	added CloseSpider exception, to manually close spiders	2011-07-12 14:24:10 -03:00
Pablo Hoffman	db5cae7c03	SitemapSpider: added support for filtering which sitemaps to follow (patch contributed by Rolando Espinoza). closes #330	2011-06-23 18:18:29 -03:00
Pablo Hoffman	57c43fdce6	added SitemapSpider, with tests and doc	2011-06-15 11:54:34 -03:00
Pablo Hoffman	91dc46539f	added LogStats extension for periodically logging basic stats (like crawled pages and scraped items)	2011-06-14 00:50:05 -03:00
Pablo Hoffman	841e9913db	renamed CLOSESPIDER_ITEMPASSED setting to CLOSESPIDER_ITEMCOUNT, to follow the refactoring done in r2630	2011-06-13 16:58:51 -03:00
Pablo Hoffman	474cba512c	simplified MemoryDebugger extension to use stats for dumping memory debugging info	2011-06-06 03:13:28 -03:00
Pablo Hoffman	5fbc32c015	call stats collector engine_stopped() after the engine is closed (to make sure all data from extensions has been collected), and added that method to documented api	2011-06-06 03:12:40 -03:00
Pablo Hoffman	9d9c8877da	added 'scrapy edit' command	2011-06-05 22:02:56 -03:00
Pablo Hoffman	1bc2339bb8	Merged item passed and item scraped concepts, as they have often proved confusing in the past. This means: * original item_scraped signal was removed * original item_passed signal was renamed to item_scraped * old log lines "Scraped Item..." removed * old log lines "Passed Item..." renamed to "Scraped Item..."	2011-06-03 01:13:00 -03:00
Pablo Hoffman	e6091df551	fixed doc typo	2011-05-30 09:04:31 -03:00
Pablo Hoffman	1d98fc8fb5	added spider_error signal	2011-05-29 22:38:17 -03:00
Pablo Hoffman	2fa0f75f2d	added COOKIES_ENABLED setting to support disabling the cookies middleware	2011-05-27 00:35:34 -03:00
Pablo Hoffman	d72d3f4607	stack trace dump extension: also dump engine status, and support triggering it with SIGQUIT, besides SIGUSR2	2011-05-20 03:25:00 -03:00
Pablo Hoffman	951ba507f9	Removed support for default values in Scrapy items, which have proven confusing in the past	2011-05-19 21:42:46 -03:00
Pablo Hoffman	503f302010	removed remaining references to scheduler middleware from doc, as it will be removed on next release	2011-05-18 19:48:48 -03:00
Pablo Hoffman	3fd17432cf	fixed outdated documentation	2011-05-18 14:46:20 -03:00
Pablo Hoffman	cd85c12c33	Some Link extractor improvements: * added support for ignoring common file extensions that are not followed if they occur in links * fixed link extractor documentation issues * slighly improved performance of applying filters * added link to link extractors doc from documentation index	2011-05-18 12:32:34 -03:00

1 2 3 4 5 ...

329 Commits