scrapy

mirror of https://github.com/scrapy/scrapy.git synced 2025-02-24 23:04:14 +00:00

Author	SHA1	Message	Date
Daniel Grana	68a875edb0	update ENCODING_ALIASES setting default value in settings documentation topic	2010-04-07 10:54:54 -03:00
Daniel Grana	8b86e1d008	Minimize effect of http://bugs.python.org/issue8271 on TextResponses by changing str.decode errors policy by custom `replace` alike error handler	2010-04-07 00:29:53 -03:00
Pablo Hoffman	3fcd69c347	added a couple additional TwistedPluginSpiderManager tests	2010-04-06 10:55:21 -03:00
daniel	2cd591e8a7	add missing dropin.cache file required by default spidermanager tests	2010-04-06 07:22:50 +01:00
Daniel Grana	0b07742adb	gb2312 and gbk encodings was superseded by gb18030	2010-04-05 15:07:43 -03:00
Pablo Hoffman	0dfec04439	made Spider name required again (do not default)	2010-04-05 12:34:29 -03:00
Daniel Grana	70ac6642d5	SEP-012: bugfix backward compatibility of Spider.domain_name and Spider.extra_domain_names --HG-- extra : rebase_source : 66f779cddc6854092951078d443dbf9113f7576a	2010-04-05 12:09:43 -03:00
Pablo Hoffman	77a4d9aba9	use a default name for spiders constructed without names	2010-04-05 11:53:22 -03:00
Pablo Hoffman	c99e1af766	Added support for passing generic arguments to spider constructors (refs #152 ), extended Spider tests, added unittests for TwistedPluginSpiderManager	2010-04-05 11:27:19 -03:00
Pablo Hoffman	de32612c99	Automated merge with http://hg.scrapy.org/scrapy-0.8	2010-04-02 02:49:51 -03:00
Pablo Hoffman	dfdac356af	added missing default values to file xporter doc	2010-04-02 02:49:18 -03:00
Rolando Espinoza La fuente	db5c3df679	SEP12 implementation * Rename BaseSpider.domain_name to BaseSpider.name This patch implements the domain_name to name change in BaseSpider class and change all spider instantiations to use the new attribute. * Add allowed_domains to spider This patch implements the merging of spider.domain_name and spider.extra_domain_names in spider.allowed_domains for offsite checking purposes. Note that spider.domain_name is not touched by this patch, only not used. * Remove spider.domain_name references from scrapy.stats * Rename domain_stats to spider_stats in MemoryStatsCollector * Use ``spider`` instead of ``domain`` in SimpledbStatsCollector * Rename domain_stats_history table to spider_data_history and rename domain field to spider in MysqlStatsCollector * Refactor genspider command The new signature for genspider is: genspider [options] <domain_name>. Genspider uses domain_name for spider name and for the module name. * Remove spider.domain_name references * Update crawl command signature <spider\|url> * docs: updated references to domain_name * examples/experimental: use spider.name * genspider: require <name> <domain> * spidermanager: renamed crawl_domain to crawl_spider_name * spiderctl: updated references of domain to spider * added backward compatiblity with legacy spider's attributes 'domain_name' and 'extra_domain_names'	2010-04-01 18:27:22 -03:00
Rolando Espinoza La fuente	35a7059636	cleanup and refactor of parse & fetch commands * removed scrapy.utils.fetch * each command schedule requests and start scrapy engine * fetch command instance BaseSpider if given url does not match any spider or match more than one * parse command schedule url if one spider matches * parse and fetch doesn't support multiple urls as parameter * force spider behavior --spider moved from BaseCommand to only commands: fetch, parse, crawl	2010-04-01 17:16:38 -03:00
Rolando Espinoza La fuente	dd477914db	spidermanager refactoring * Implements find/create method in Spider Manager API, removed fromdomain and fromurl This method is now in charge of spider resolution, it must return spider object from its argument or raise KeyError if no spider is found. This method obsoletes from_domain and from_url methods. The default implementation of resolve only searches against spider.name, it won't use spider.allowed_domains like the old fromdomain. This is the reason of why you must supply a spider if you want to crawl an url. Find methods returns only available spider names. Not spider instances. If no spider found returns empty list. Affected modules: * command.models (force_domain) * removed spiders.force_domain * each command pass spider to crawl_* commands * command.commands.* * crawl * set spider from opts.spider if arg is url * group urls by spider to instance spider just once * genspider * use spiders.create() to check spider id * parse * log error if more than one spider found * core.manager * on crawl_* log message if multiple spiders found for url or request * shell * prints "Multiple found" if more than one spider found for url or request * populate_vars(): added spider keyword parameter * contrib.spidermanager: * removed fromdomain() & fromurl() * new create(spider_id) -> Spider. Raises KeyError if spider not found * new find_by_request(request) -> list(spiders)	2010-04-01 17:16:38 -03:00
Rolando Espinoza La fuente	8db67b17a3	scrapy manager refactor * ExecutionManager * deprecated runonce(args) changed start() to start(keep_alive=Bool) * changed crawl(args) to crawl(requests, spider=None) if no spider given, tries to resolve spider for each request * added crawl_url(url, spider=None) * added crawl_request(request, spider=None) * added crawl_domain(domain) * added crawl_spider(spider) * updated commands: crawl, runspider, start * updated webconsole * updated crawler * updated tests.test_engine * updated utils.fetch	2010-04-01 17:16:38 -03:00
Pablo Hoffman	32f9c5fe68	removed old untested (and probably broken) code	2010-04-01 04:05:53 -03:00
Pablo Hoffman	4dc886e319	Improved comment	2010-03-31 18:26:35 -03:00
Pablo Hoffman	83d5eff0b7	More refactoring to encoding handling in TextResponse and subclasses	2010-03-31 18:21:41 -03:00
Pablo Hoffman	de896fa62d	Refactored implementation of Request.replace() and Response.replace()	2010-03-31 16:29:53 -03:00
Pablo Hoffman	2ed8a5bfb5	Automated merge with http://hg.scrapy.org/scrapy-0.8	2010-03-27 13:25:06 -03:00
Pablo Hoffman	2f75839e7a	Ignore noisy Twisted deprecation warnings	2010-03-27 13:23:13 -03:00
Pablo Hoffman	2299deda66	updated wrong link in doc	2010-03-26 14:02:33 -03:00
Pablo Hoffman	7cf2f87e27	Automated merge with http://hg.scrapy.org/scrapy-0.8	2010-03-26 08:29:34 -03:00
Pablo Hoffman	f19c939925	fixed doc typo	2010-03-26 08:28:32 -03:00
Daniel Grana	996a1b3574	fix handling of relative base urls in get_base_url util --HG-- extra : rebase_source : eb552219e6bf40bc0d2e35968c367105233b6ecc	2010-03-25 15:50:34 -03:00
Pablo Hoffman	1330697c3d	Some improvements to Response encoding support: * added encoding aliases, configurable through a new ENCODING_ALIASES setting * Response.encoding now returns the real encoding detected for the body * simplified TextResponse API by removing body_encoding() and headers_encoding() methods * Response.encoding now tries to infer the encoding from the body always (it was done before only on HtmlResponse and TextResponse) * removed scrapy.utils.encoding.add_encoding_alias() function * updated implementation of scrapy.utils.response function to reflect these API changes * updated documentation to reflect API changes	2010-03-25 15:47:10 -03:00
Daniel Grana	173e94386b	Support relative url used in base tag. closes #148 --HG-- extra : rebase_source : 1bff87c127a7e9d8d12c772b3068feb11eb5d97f	2010-03-25 12:38:37 -03:00
Pablo Hoffman	9ddcd1095d	sort setting alphabetically	2010-03-25 11:45:06 -03:00
Pablo Hoffman	cb49567ca6	Removed wrong line added in previous commit	2010-03-24 12:15:18 -03:00
Pablo Hoffman	45411926b5	Improved encoding support by explicitly passing encoding to all str_to_unicode() and unicode_to_str() calls	2010-03-24 12:14:07 -03:00
Pablo Hoffman	4fa833c849	Added LOG_ENCODING setting	2010-03-24 12:13:38 -03:00
Pablo Hoffman	87e68e7438	Made MailSender non IO-blocking, and improved MailSender documentation	2010-03-22 13:37:37 -03:00
Pablo Hoffman	1dfc79b5d0	Automated merge with http://hg.scrapy.org/scrapy-0.8	2010-03-20 20:48:11 -03:00
Pablo Hoffman	99a876754c	Improved "What else?" section of "Scrapy at a glance" overview	2010-03-20 20:24:18 -03:00
Pablo Hoffman	264cd2e035	Automated merge with http://hg.scrapy.org/scrapy-0.8	2010-03-19 10:32:42 -03:00
Pablo Hoffman	234fd709ad	fixed doc typo (thanks Victor)	2010-03-19 10:32:17 -03:00
Daniel Grana	184cf6684f	Remove HttpException references from docs. Since 0.7, scrapy returns non-200 as Response objects and does not raise HttpException anymore	2010-03-18 10:05:33 -03:00
Pablo Hoffman	403a21ec74	removed obsolete scrapy.crawler module	2010-03-12 17:28:33 -02:00
Daniel Grana	17091902f3	Explicity say where to save item class in "Defining our item" section of tutorial	2010-03-12 14:12:49 -02:00
Pablo Hoffman	54ae2c36d0	better implementation of open_in_browser() tests	2010-03-12 10:19:50 -02:00
Pablo Hoffman	38a296aa2c	Added tests to open_in_browser() function	2010-03-12 09:52:39 -02:00
Pablo Hoffman	2ab94d75e2	Automated merge with http://hg.scrapy.org/scrapy-0.8	2010-03-12 09:32:35 -02:00
Pablo Hoffman	c5cd8b9d3d	Fixed bug in open_in_browser() function with Python 2.5 (closes #145 ).	2010-03-12 09:31:05 -02:00
Pablo Hoffman	39e4df0cff	removed unmaintained (and untested) contrib_exp ShoveItemPipeline	2010-03-10 00:10:36 -02:00
Pablo Hoffman	a505a9d490	minor code refactoring on scrapy.command.cmdline module	2010-03-04 11:09:16 -02:00
Pablo Hoffman	4c1ec0c97e	replaced hacky command_executed dict by standard signal	2010-03-04 10:58:18 -02:00
Pablo Hoffman	861f9691c7	removed partly-obsolete module scrapy.contrib.groupsettings	2010-03-04 10:40:41 -02:00
Pablo Hoffman	d12cd22d5e	switched default scheduler order to DFO, which consumes less memory by default	2010-03-04 10:15:58 -02:00
Daniel Grana	700be3202b	Automated merge with ssh://hg.scrapy.org/scrapy-0.8	2010-02-24 15:44:41 -02:00
Daniel Grana	2322322ee6	Add missing priority and errback arguments to Request.replace method signature	2010-02-24 15:43:09 -02:00

1 2 3 4 5 ...

1987 Commits