scrapy

mirror of https://github.com/scrapy/scrapy.git synced 2025-02-27 23:24:01 +00:00

Author	SHA1	Message	Date
Pablo Hoffman	27dd68a690	added SpiderState extension	2011-09-02 13:06:59 -03:00
Pablo Hoffman	76af0cdd44	updated documentation and code to use -s instead of --set option	2011-09-01 14:35:37 -03:00
Pablo Hoffman	9d97e73a24	fixed priority handling on the new scheduler so that it's backwards compatible (ie. bigger priorities are higher). also fixed a few documentation bugs related to requests priority	2011-08-19 08:26:41 -03:00
Pablo Hoffman	a3697421c0	some minor updates to documentation	2011-08-11 09:19:59 -03:00
Pablo Hoffman	19e6da59d8	added new downloader middleware: ChunkedTransferMiddleware	2011-08-09 03:03:25 -03:00
Pablo Hoffman	9f60c27612	added setting to support disabling DNS cache: DNSCACHE_ENABLED	2011-08-05 20:41:59 -03:00
Pablo Hoffman	549725215e	Initial support for a persistent scheduler, to support pausing and resuming crawls. * requests are serialized (using marshal by default) and stored on disk, using one queue per priority * request priorities must be integers now * breadh-first and depth-first crawling orders can now be configured through a new DEPTH_PRIORITY setting (see doc). backwards compatilibty with SCHEDULER_ORDER was kept. * requests that can't be serialized (for example, non serializable callbacks) are always kept in memory queues * adapted crawl spider to work with persitent scheduler	2011-08-02 11:57:55 -03:00
Pablo Hoffman	ce7a787970	Big downloader refactoring to support real concurrency limits per domain/ip, instead of global limits per spider which were a bit useless. This removes the setting CONCURRENT_REQUESTS_PER_SPIDER and adds thre new settings: * CONCURRENT_REQUESTS * CONCURRENT_REQUESTS_PER_DOMAIN * CONCURRENT_REQUESTS_PER_IP (overrides per domain) The AutoThrottle extension had to be disabled, but will be ported and re-enabled soon.	2011-07-27 13:38:09 -03:00
Pablo Hoffman	91dc46539f	added LogStats extension for periodically logging basic stats (like crawled pages and scraped items)	2011-06-14 00:50:05 -03:00
Pablo Hoffman	9d9c8877da	added 'scrapy edit' command	2011-06-05 22:02:56 -03:00
Pablo Hoffman	2fa0f75f2d	added COOKIES_ENABLED setting to support disabling the cookies middleware	2011-05-27 00:35:34 -03:00
Pablo Hoffman	503f302010	removed remaining references to scheduler middleware from doc, as it will be removed on next release	2011-05-18 19:48:48 -03:00
Pablo Hoffman	3fd17432cf	fixed outdated documentation	2011-05-18 14:46:20 -03:00
Pablo Hoffman	495152bd50	disabled verbose depth stats collection by default, added DEPTH_STATS_VERBOSE setting to enable it	2011-05-18 11:04:48 -03:00
Pablo Hoffman	accb6ed830	dump stats to log by default (ie. change default value of STATS_DUMP to True)	2011-05-17 22:42:05 -03:00
Pablo Hoffman	b76c5c597f	* Added support for project data storage (closes #276 ) * Documented project file structure * Moved default location of SQLite database to project data storage dir (closes #277)	2010-10-31 03:25:37 -02:00
Pablo Hoffman	9599bde3e9	Removed RequestLimitMiddleware	2010-09-22 16:09:13 -03:00
Pablo Hoffman	ed4aec187f	Ported code to use new unified access to spider settings, keeping backwards compatibility for old spider attributes. Refs #245	2010-09-22 16:09:13 -03:00
Pablo Hoffman	b6c2b55e5b	Splitted settings classes from settings singleton. Closes #244 --HG-- rename : scrapy/conf/__init__.py => scrapy/conf.py rename : scrapy/conf/default_settings.py => scrapy/settings/default_settings.py rename : scrapy/tests/test_conf.py => scrapy/tests/test_settings.py	2010-09-22 15:47:33 -03:00
Pablo Hoffman	766f2d910d	Renamed Request Handlers to Download Handlers	2010-09-05 19:35:53 -03:00
Pablo Hoffman	6bf52fb50e	Make telnet console and web service try a range of ports for binding, instead of just one. Closes #226	2010-09-05 06:48:08 -03:00
Pablo Hoffman	14e985b076	Updated Command line tool documentation	2010-09-05 05:29:58 -03:00
Pablo Hoffman	1190f97944	Updated settings documentation	2010-09-05 04:58:14 -03:00
Pablo Hoffman	053d45e79f	Splitted stats collector classes from stats collection facility (#204 ) * moved scrapy.stats.collector.__init__ module to scrapy.statscol * moved scrapy.stats.collector.simpledb module to scrapy.contrib.statscol * moved signals from scrapy.stats.signals to scrapy.signals * moved scrapy/stats/__init__.py to scrapy/stats.py * updated documentation and tests accordingly --HG-- rename : scrapy/stats/collector/simpledb.py => scrapy/contrib/statscol.py rename : scrapy/stats/__init__.py => scrapy/stats.py rename : scrapy/stats/collector/__init__.py => scrapy/statscol.py	2010-08-22 01:24:07 -03:00
Pablo Hoffman	9aefa242d5	Applied documentation patch provided by Lucian Ursu (closes #207 )	2010-08-21 01:26:35 -03:00
Pablo Hoffman	94ead94bf6	Improved documentation of Scrapy command-line tool --HG-- rename : docs/topics/cmdline.rst => docs/topics/commands.rst	2010-08-19 00:04:52 -03:00
Pablo Hoffman	34554da201	Deprecated scrapy-ctl.py command in favour of simpler "scrapy" command. Closes #199 . Also updated documenation accordingly and added convenient scrapy.bat script for running from Windows. --HG-- rename : debian/scrapy-ctl.1 => debian/scrapy.1 rename : docs/topics/scrapy-ctl.rst => docs/topics/cmdline.rst	2010-08-18 19:48:32 -03:00
Pablo Hoffman	a71521bfba	Default per-command settings are now specified in the default_settings attribute of the command object. Closes #201	2010-08-17 18:30:13 -03:00
Pablo Hoffman	e741a807d2	Added new Feed exports extension with documentation and storage tests. Closes #197 . Also deprecated File export pipeline (to be removed in Scrapy 0.11). Still need to add tests for FeedExport main extension code.	2010-08-17 14:27:48 -03:00
Pablo Hoffman	1df2c17b78	updated old documentation references	2010-08-12 20:45:11 -03:00
Pablo Hoffman	bd16d1cd48	Added SMTP-AUTH support to scrapy.mail (closes #149 )	2010-06-13 17:14:46 -03:00
Pablo Hoffman	6a33d6c4d0	* Added Scrapy Web Service with documentation and tests. * Marked Web Console as deprecated. * Removed Web Console documentation to discourage its use.	2010-06-09 13:46:22 -03:00
Pablo Hoffman	031eb1e5ed	removed no longer used SpiderScheduler (obsoleted by ExecutionQueue)	2010-05-28 17:27:15 -03:00
Ismael Carnales	a71dc295af	Some mail improvements and tests. * Add mail_sent signal and use it in MailSender * Add MAIL_DEBUG setting to not send mails when testing * Add MailSender tests	2010-05-28 16:51:47 -03:00
Daniel Grana	68a875edb0	update ENCODING_ALIASES setting default value in settings documentation topic	2010-04-07 10:54:54 -03:00
Pablo Hoffman	2299deda66	updated wrong link in doc	2010-03-26 14:02:33 -03:00
Pablo Hoffman	1330697c3d	Some improvements to Response encoding support: * added encoding aliases, configurable through a new ENCODING_ALIASES setting * Response.encoding now returns the real encoding detected for the body * simplified TextResponse API by removing body_encoding() and headers_encoding() methods * Response.encoding now tries to infer the encoding from the body always (it was done before only on HtmlResponse and TextResponse) * removed scrapy.utils.encoding.add_encoding_alias() function * updated implementation of scrapy.utils.response function to reflect these API changes * updated documentation to reflect API changes	2010-03-25 15:47:10 -03:00
Pablo Hoffman	9ddcd1095d	sort setting alphabetically	2010-03-25 11:45:06 -03:00
Pablo Hoffman	4fa833c849	Added LOG_ENCODING setting	2010-03-24 12:13:38 -03:00
Pablo Hoffman	d12cd22d5e	switched default scheduler order to DFO, which consumes less memory by default	2010-03-04 10:15:58 -02:00
Pablo Hoffman	c1f8198639	Added RANDOMIZE_DOWNLOAD_DELAY setting	2010-02-19 21:53:18 -02:00
Pablo Hoffman	57d60eae39	sort settings doc alphabetically by setting name	2010-01-31 18:11:13 -02:00
Pablo Hoffman	08eeaf98a2	fixed description of LOG_STDOUT setting	2010-01-13 15:51:08 -02:00
Pablo Hoffman	07655d05ea	renamed REQUESTS_PER_SPIDER setting to CONCURRENT_REQUESTS_PER_SPIDER	2009-11-13 14:38:22 -02:00
Pablo Hoffman	564abd10ad	Refactored HttpCache middleware: * simplified code * performance improvements * removed awkward/unused domain sectorization * it can now receive Settings on constructor * added unittests * added documentation about filesystem storage structure Also made scrapy.conf.Settings objects instantiable with a dict which is used to override default settings.	2009-11-13 14:25:47 -02:00
Pablo Hoffman	919cd5b789	renamed setting CONCURRENT_DOMAINS to CONCURRENT_SPIDERS	2009-11-06 15:44:11 -02:00
Pablo Hoffman	d604dca96d	renamed setting REQUESTS_PER_DOMAIN to REQUESTS_PER_SPIDER	2009-11-06 15:42:11 -02:00
Pablo Hoffman	7296a7b889	added DEFAULT_RESPONSE_ENCODING setting	2009-10-21 16:13:41 -02:00
Pablo Hoffman	937acd91d1	improved documentation of http proxy middleware	2009-10-07 21:00:34 -02:00
Daniel Grana	8aa7d153ae	rewrote of downloader handlers * add REQUEST_HANDLERS setting with defaults for file, http and https schemes * add documentation of new setting * add unittests for all the builtin handlers * remove unused getPage function	2009-10-05 04:10:22 -02:00

1 2 3

122 Commits