scrapy

mirror of https://github.com/scrapy/scrapy.git synced 2025-02-28 10:04:14 +00:00

Author	SHA1	Message	Date
Pablo Hoffman	9f60c27612	added setting to support disabling DNS cache: DNSCACHE_ENABLED	2011-08-05 20:41:59 -03:00
Pablo Hoffman	549725215e	Initial support for a persistent scheduler, to support pausing and resuming crawls. * requests are serialized (using marshal by default) and stored on disk, using one queue per priority * request priorities must be integers now * breadh-first and depth-first crawling orders can now be configured through a new DEPTH_PRIORITY setting (see doc). backwards compatilibty with SCHEDULER_ORDER was kept. * requests that can't be serialized (for example, non serializable callbacks) are always kept in memory queues * adapted crawl spider to work with persitent scheduler	2011-08-02 11:57:55 -03:00
Pablo Hoffman	ce7a787970	Big downloader refactoring to support real concurrency limits per domain/ip, instead of global limits per spider which were a bit useless. This removes the setting CONCURRENT_REQUESTS_PER_SPIDER and adds thre new settings: * CONCURRENT_REQUESTS * CONCURRENT_REQUESTS_PER_DOMAIN * CONCURRENT_REQUESTS_PER_IP (overrides per domain) The AutoThrottle extension had to be disabled, but will be ported and re-enabled soon.	2011-07-27 13:38:09 -03:00
Pablo Hoffman	91dc46539f	added LogStats extension for periodically logging basic stats (like crawled pages and scraped items)	2011-06-14 00:50:05 -03:00
Pablo Hoffman	9d9c8877da	added 'scrapy edit' command	2011-06-05 22:02:56 -03:00
Pablo Hoffman	2fa0f75f2d	added COOKIES_ENABLED setting to support disabling the cookies middleware	2011-05-27 00:35:34 -03:00
Pablo Hoffman	503f302010	removed remaining references to scheduler middleware from doc, as it will be removed on next release	2011-05-18 19:48:48 -03:00
Pablo Hoffman	3fd17432cf	fixed outdated documentation	2011-05-18 14:46:20 -03:00
Pablo Hoffman	495152bd50	disabled verbose depth stats collection by default, added DEPTH_STATS_VERBOSE setting to enable it	2011-05-18 11:04:48 -03:00
Pablo Hoffman	accb6ed830	dump stats to log by default (ie. change default value of STATS_DUMP to True)	2011-05-17 22:42:05 -03:00
Pablo Hoffman	b76c5c597f	* Added support for project data storage (closes #276 ) * Documented project file structure * Moved default location of SQLite database to project data storage dir (closes #277)	2010-10-31 03:25:37 -02:00
Pablo Hoffman	9599bde3e9	Removed RequestLimitMiddleware	2010-09-22 16:09:13 -03:00
Pablo Hoffman	ed4aec187f	Ported code to use new unified access to spider settings, keeping backwards compatibility for old spider attributes. Refs #245	2010-09-22 16:09:13 -03:00
Pablo Hoffman	b6c2b55e5b	Splitted settings classes from settings singleton. Closes #244 --HG-- rename : scrapy/conf/__init__.py => scrapy/conf.py rename : scrapy/conf/default_settings.py => scrapy/settings/default_settings.py rename : scrapy/tests/test_conf.py => scrapy/tests/test_settings.py	2010-09-22 15:47:33 -03:00
Pablo Hoffman	766f2d910d	Renamed Request Handlers to Download Handlers	2010-09-05 19:35:53 -03:00
Pablo Hoffman	6bf52fb50e	Make telnet console and web service try a range of ports for binding, instead of just one. Closes #226	2010-09-05 06:48:08 -03:00
Pablo Hoffman	14e985b076	Updated Command line tool documentation	2010-09-05 05:29:58 -03:00
Pablo Hoffman	1190f97944	Updated settings documentation	2010-09-05 04:58:14 -03:00
Pablo Hoffman	053d45e79f	Splitted stats collector classes from stats collection facility (#204 ) * moved scrapy.stats.collector.__init__ module to scrapy.statscol * moved scrapy.stats.collector.simpledb module to scrapy.contrib.statscol * moved signals from scrapy.stats.signals to scrapy.signals * moved scrapy/stats/__init__.py to scrapy/stats.py * updated documentation and tests accordingly --HG-- rename : scrapy/stats/collector/simpledb.py => scrapy/contrib/statscol.py rename : scrapy/stats/__init__.py => scrapy/stats.py rename : scrapy/stats/collector/__init__.py => scrapy/statscol.py	2010-08-22 01:24:07 -03:00
Pablo Hoffman	9aefa242d5	Applied documentation patch provided by Lucian Ursu (closes #207 )	2010-08-21 01:26:35 -03:00
Pablo Hoffman	94ead94bf6	Improved documentation of Scrapy command-line tool --HG-- rename : docs/topics/cmdline.rst => docs/topics/commands.rst	2010-08-19 00:04:52 -03:00
Pablo Hoffman	34554da201	Deprecated scrapy-ctl.py command in favour of simpler "scrapy" command. Closes #199 . Also updated documenation accordingly and added convenient scrapy.bat script for running from Windows. --HG-- rename : debian/scrapy-ctl.1 => debian/scrapy.1 rename : docs/topics/scrapy-ctl.rst => docs/topics/cmdline.rst	2010-08-18 19:48:32 -03:00
Pablo Hoffman	a71521bfba	Default per-command settings are now specified in the default_settings attribute of the command object. Closes #201	2010-08-17 18:30:13 -03:00
Pablo Hoffman	e741a807d2	Added new Feed exports extension with documentation and storage tests. Closes #197 . Also deprecated File export pipeline (to be removed in Scrapy 0.11). Still need to add tests for FeedExport main extension code.	2010-08-17 14:27:48 -03:00
Pablo Hoffman	1df2c17b78	updated old documentation references	2010-08-12 20:45:11 -03:00
Pablo Hoffman	bd16d1cd48	Added SMTP-AUTH support to scrapy.mail (closes #149 )	2010-06-13 17:14:46 -03:00
Pablo Hoffman	6a33d6c4d0	* Added Scrapy Web Service with documentation and tests. * Marked Web Console as deprecated. * Removed Web Console documentation to discourage its use.	2010-06-09 13:46:22 -03:00
Pablo Hoffman	031eb1e5ed	removed no longer used SpiderScheduler (obsoleted by ExecutionQueue)	2010-05-28 17:27:15 -03:00
Ismael Carnales	a71dc295af	Some mail improvements and tests. * Add mail_sent signal and use it in MailSender * Add MAIL_DEBUG setting to not send mails when testing * Add MailSender tests	2010-05-28 16:51:47 -03:00
Daniel Grana	68a875edb0	update ENCODING_ALIASES setting default value in settings documentation topic	2010-04-07 10:54:54 -03:00
Pablo Hoffman	2299deda66	updated wrong link in doc	2010-03-26 14:02:33 -03:00
Pablo Hoffman	1330697c3d	Some improvements to Response encoding support: * added encoding aliases, configurable through a new ENCODING_ALIASES setting * Response.encoding now returns the real encoding detected for the body * simplified TextResponse API by removing body_encoding() and headers_encoding() methods * Response.encoding now tries to infer the encoding from the body always (it was done before only on HtmlResponse and TextResponse) * removed scrapy.utils.encoding.add_encoding_alias() function * updated implementation of scrapy.utils.response function to reflect these API changes * updated documentation to reflect API changes	2010-03-25 15:47:10 -03:00
Pablo Hoffman	9ddcd1095d	sort setting alphabetically	2010-03-25 11:45:06 -03:00
Pablo Hoffman	4fa833c849	Added LOG_ENCODING setting	2010-03-24 12:13:38 -03:00
Pablo Hoffman	d12cd22d5e	switched default scheduler order to DFO, which consumes less memory by default	2010-03-04 10:15:58 -02:00
Pablo Hoffman	c1f8198639	Added RANDOMIZE_DOWNLOAD_DELAY setting	2010-02-19 21:53:18 -02:00
Pablo Hoffman	57d60eae39	sort settings doc alphabetically by setting name	2010-01-31 18:11:13 -02:00
Pablo Hoffman	08eeaf98a2	fixed description of LOG_STDOUT setting	2010-01-13 15:51:08 -02:00
Pablo Hoffman	07655d05ea	renamed REQUESTS_PER_SPIDER setting to CONCURRENT_REQUESTS_PER_SPIDER	2009-11-13 14:38:22 -02:00
Pablo Hoffman	564abd10ad	Refactored HttpCache middleware: * simplified code * performance improvements * removed awkward/unused domain sectorization * it can now receive Settings on constructor * added unittests * added documentation about filesystem storage structure Also made scrapy.conf.Settings objects instantiable with a dict which is used to override default settings.	2009-11-13 14:25:47 -02:00
Pablo Hoffman	919cd5b789	renamed setting CONCURRENT_DOMAINS to CONCURRENT_SPIDERS	2009-11-06 15:44:11 -02:00
Pablo Hoffman	d604dca96d	renamed setting REQUESTS_PER_DOMAIN to REQUESTS_PER_SPIDER	2009-11-06 15:42:11 -02:00
Pablo Hoffman	7296a7b889	added DEFAULT_RESPONSE_ENCODING setting	2009-10-21 16:13:41 -02:00
Pablo Hoffman	937acd91d1	improved documentation of http proxy middleware	2009-10-07 21:00:34 -02:00
Daniel Grana	8aa7d153ae	rewrote of downloader handlers * add REQUEST_HANDLERS setting with defaults for file, http and https schemes * add documentation of new setting * add unittests for all the builtin handlers * remove unused getPage function	2009-10-05 04:10:22 -02:00
Pablo Hoffman	921fc4f3bf	Big Scrapy core refactoring to pass around spider references instead of domains. This is to avoid accessing the scrapy.spider.spiders singleton for "resolving" spiders, which is considered an "evil" practice because it ties us to the singleton model for the spider resolver, which is a bad thing. This change will also work as the foundation for the API cleaning that we'll perform for 0.8. We decided to introduce this change now to have a more common basecode between 0.7 and 0.8, which will allow us to better support 0.7 until 0.8 is released. However, this change doesn't modify the stable/documented API, nor does it change the core logic. Those changes will land on the 0.8 branch, after 0.7 is released. --HG-- rename : scrapy/contrib/domainsch.py => scrapy/contrib/spiderscheduler.py	2009-09-12 14:34:18 -03:00
Pablo Hoffman	f1bb8dc2a3	first cleanup of spider manager api - removed asdict() and reload() methods - added list() method - removed default spider	2009-09-10 19:06:46 -03:00
Pablo Hoffman	269724a2b7	added Debugger extension, removed StackTraceDump from extensions available by default	2009-09-08 22:32:17 -03:00
Pablo Hoffman	827aa19c6e	removed obsolete scrapy.utils.db module	2009-09-04 17:38:14 -03:00
Pablo Hoffman	861a803cc3	removed obsolete RestrictMiddleware	2009-09-04 17:22:56 -03:00

1 2 3

117 Commits