scrapy

mirror of https://github.com/scrapy/scrapy.git synced 2025-02-24 23:04:14 +00:00

Author	SHA1	Message	Date
Pablo Hoffman	9c3b9f2968	fixed bug in json-rpc webservice reported in https://groups.google.com/d/topic/scrapy-users/qgVBmFybNAQ/discussion . also removed no longer supported 'run' command from extras/scrapy-ws.py	2012-05-03 12:05:40 -03:00
Pablo Hoffman	abcac4fcbd	updated maintainer to scrapinghub	2012-05-02 03:25:35 -03:00
stav	86dba76d1f	documentation indentation	2012-04-30 13:09:34 -05:00
Pablo Hoffman	d567d8efbe	added note to docs/topics/firebug.rst about google directory being shut down	2012-04-19 01:34:20 -03:00
stav	f1802289cd	small doc typo change to get the fork rolling	2012-04-11 12:05:39 -05:00
Pablo Hoffman	27018fced7	changed default user agent to Scrapy/0.15 (+http://scrapy.org ) and removed no longer needed BOT_VERSION setting	2012-03-23 13:45:21 -03:00
Pablo Hoffman	8933e2f2be	added REFERER_ENABLED setting, to control referer middleware	2012-03-22 16:35:14 -03:00
Jason Yeo	da826aa13d	fixed minor mistake in Request objects documentation	2012-03-21 10:25:41 +08:00
Pablo Hoffman	175c70ad44	fixed minor defect in link extractors documentation	2012-03-20 22:56:45 -03:00
Pablo Hoffman	35fb01156e	removed some obsolete remaining code related to sqlite support in scrapy	2012-03-16 11:55:55 -03:00
Pablo Hoffman	2b16ebdc11	added minor clarification on cookiejar request meta key usage	2012-02-29 07:19:01 -02:00
lostsnow	5afe4f50c1	scrapyd: support bind to a specific ip address	2012-02-29 13:47:40 +08:00
Pablo Hoffman	81abb45000	fixed bug in new cookiejar documentation	2012-02-28 11:08:25 -02:00
Pablo Hoffman	26c8004125	added documentation for the new cookiejar Request.meta key	2012-02-27 19:58:58 -02:00
Pablo Hoffman	7fe7c3f3b1	MemoryUsage extension: close the spiders (instead of stopping the engine) when the limit is exceeded, providing a descriptive reason for the close. Also fixed default value of MEMUSAGE_ENABLED setting to match the documentation.	2012-02-23 17:05:06 -02:00
Pablo Hoffman	7b8942a648	updated StackTraceDump extension doc	2012-02-16 15:14:17 -02:00
Pablo Hoffman	0b0bce7f3c	scrapyd: added cancel.json and listjobs.json api methods to documentation	2012-01-05 11:23:25 -02:00
Pablo Hoffman	8f42633a94	scrapyd: added clarification about how to disable items feeds generation	2012-01-05 11:20:50 -02:00
Pablo Hoffman	dbda33efa6	scrapyd: added support for storing items by default Items are stored the same way as logs, in jsonlines format. Also renamed logs_to_keep setting to jobs_to_keep.	2012-01-03 23:08:54 -02:00
Pablo Hoffman	41fd3c4f6c	doc: removed duplicated callback argument from Request.replace()	2011-12-23 15:55:46 -02:00
Pablo Hoffman	0eeff76227	fixed formatting of scrapyd doc	2011-12-20 03:18:37 -02:00
Pablo Hoffman	992af8d38f	ubuntu repos: added support for oneiric release	2011-10-25 14:26:38 -02:00
Pablo Hoffman	c38c49d56a	fixed PickeItemExporter bug, added unittest, and added pickle to suported feed exports formats	2011-10-25 02:36:51 -02:00
Pablo Hoffman	8bdf288428	made scrapyd doc more version agnostic	2011-10-23 05:29:54 -02:00
Pablo Hoffman	431441cb52	updated documentation to remove references to old issue tracker and mercurial repos	2011-09-25 13:06:24 -03:00
Pablo Hoffman	ce03ccd4ec	updated documentation about DEPTH_PRIORITY and DFO/BFO crawls	2011-09-23 13:22:25 -03:00
Julien Duponchelle	b7c436343a	scrapy deploy support git version	2011-09-21 22:17:08 +02:00
Daniel Grana	5f1b1c05f8	Do not filter requests with dont_filter attribute set in OffsiteMiddleware	2011-09-08 15:18:10 -03:00
Pablo Hoffman	bff3d31469	scrapyd: updated schedule.json response format	2011-09-04 09:29:24 -03:00
Pablo Hoffman	a1dbc62b45	removed CONCURRENT_SPIDERS setting (use scrapyd maxproc instead)	2011-09-02 18:27:39 -03:00
Pablo Hoffman	40f7075f11	added initial documentation about suspend and resume crawls	2011-09-02 13:12:27 -03:00
Pablo Hoffman	27dd68a690	added SpiderState extension	2011-09-02 13:06:59 -03:00
Pablo Hoffman	6a31ab667d	minor fix to doc	2011-09-01 15:08:23 -03:00
Pablo Hoffman	d98b058c21	no longer recommend using labmda's in the doc, as they're not friendly with scheduler persistence	2011-09-01 15:06:49 -03:00
Pablo Hoffman	76af0cdd44	updated documentation and code to use -s instead of --set option	2011-09-01 14:35:37 -03:00
Pablo Hoffman	98b68ca89d	scrapyd: documented support for passing setting to spiders in schedule.json	2011-08-27 01:31:12 -03:00
Pablo Hoffman	5c6b0631e2	minor doc fix	2011-08-19 11:42:03 -03:00
Pablo Hoffman	9d97e73a24	fixed priority handling on the new scheduler so that it's backwards compatible (ie. bigger priorities are higher). also fixed a few documentation bugs related to requests priority	2011-08-19 08:26:41 -03:00
Pablo Hoffman	a3697421c0	some minor updates to documentation	2011-08-11 09:19:59 -03:00
Pablo Hoffman	19e6da59d8	added new downloader middleware: ChunkedTransferMiddleware	2011-08-09 03:03:25 -03:00
Pablo Hoffman	984be35461	Some telnet console changes: * renamed manager alias to crawler * added aliases: spider, slot * fixed est() function	2011-08-08 15:01:08 -03:00
Pablo Hoffman	f7c0aeccc6	added note about engine_started signal	2011-08-07 03:57:09 -03:00
Pablo Hoffman	9f60c27612	added setting to support disabling DNS cache: DNSCACHE_ENABLED	2011-08-05 20:41:59 -03:00
Pablo Hoffman	cb95d7a5af	added marshal to formats supported by feed exports	2011-08-03 16:16:48 -03:00
Pablo Hoffman	549725215e	Initial support for a persistent scheduler, to support pausing and resuming crawls. * requests are serialized (using marshal by default) and stored on disk, using one queue per priority * request priorities must be integers now * breadh-first and depth-first crawling orders can now be configured through a new DEPTH_PRIORITY setting (see doc). backwards compatilibty with SCHEDULER_ORDER was kept. * requests that can't be serialized (for example, non serializable callbacks) are always kept in memory queues * adapted crawl spider to work with persitent scheduler	2011-08-02 11:57:55 -03:00
Pablo Hoffman	ce7a787970	Big downloader refactoring to support real concurrency limits per domain/ip, instead of global limits per spider which were a bit useless. This removes the setting CONCURRENT_REQUESTS_PER_SPIDER and adds thre new settings: * CONCURRENT_REQUESTS * CONCURRENT_REQUESTS_PER_DOMAIN * CONCURRENT_REQUESTS_PER_IP (overrides per domain) The AutoThrottle extension had to be disabled, but will be ported and re-enabled soon.	2011-07-27 13:38:09 -03:00
Pablo Hoffman	2ac08a713d	downloader: renamed SpiderInfo to Slot, for consistency with engine and scraper names	2011-07-22 02:06:10 -03:00
Pablo Hoffman	0e008268e1	removed SimpledbStatsCollector from scrapy code, it was moved to https://github.com/scrapinghub/scaws	2011-07-20 10:38:16 -03:00
Pablo Hoffman	84f518fc5e	More core changes: * removed execution queue (replaced by newer spider queues) * added real support for returning iterators in Spider.start_requests() * removed support for passing urls to 'scrapy crawl' command	2011-07-15 15:18:39 -03:00
Pablo Hoffman	dbad1373f1	Automated merge with ssh://hg.scrapy.org:2222/scrapy-0.12	2011-07-13 18:44:54 -03:00

1 2 3 4 5 ...

449 Commits