* Rename BaseSpider.domain_name to BaseSpider.name
This patch implements the domain_name to name change in BaseSpider class and
change all spider instantiations to use the new attribute.
* Add allowed_domains to spider
This patch implements the merging of spider.domain_name and
spider.extra_domain_names in spider.allowed_domains for offsite checking
purposes.
Note that spider.domain_name is not touched by this patch, only not used.
* Remove spider.domain_name references from scrapy.stats
* Rename domain_stats to spider_stats in MemoryStatsCollector
* Use ``spider`` instead of ``domain`` in SimpledbStatsCollector
* Rename domain_stats_history table to spider_data_history and rename domain
field to spider in MysqlStatsCollector
* Refactor genspider command
The new signature for genspider is: genspider [options] <domain_name>.
Genspider uses domain_name for spider name and for the module name.
* Remove spider.domain_name references
* Update crawl command signature <spider|url>
* docs: updated references to domain_name
* examples/experimental: use spider.name
* genspider: require <name> <domain>
* spidermanager: renamed crawl_domain to crawl_spider_name
* spiderctl: updated references of *domain* to spider
* added backward compatiblity with legacy spider's attributes
'domain_name' and 'extra_domain_names'
* removed scrapy.utils.fetch
* each command schedule requests and start scrapy engine
* fetch command instance BaseSpider if given url does not match any spider or match more than one
* parse command schedule url if one spider matches
* parse and fetch doesn't support multiple urls as parameter
* force spider behavior --spider moved from BaseCommand to only commands: fetch, parse, crawl
* Implements find/create method in Spider Manager API, removed fromdomain and fromurl
This method is now in charge of spider resolution, it must return spider object
from its argument or raise KeyError if no spider is found.
This method obsoletes from_domain and from_url methods.
The default implementation of resolve only searches against spider.name, it
won't use spider.allowed_domains like the old fromdomain. This is the reason
of why you must supply a spider if you want to crawl an url.
Find methods returns only available spider names. Not spider instances.
If no spider found returns empty list.
Affected modules:
* command.models (force_domain)
* removed spiders.force_domain
* each command pass spider to crawl_* commands
* command.commands.*
* crawl
* set spider from opts.spider if arg is url
* group urls by spider to instance spider just once
* genspider
* use spiders.create() to check spider id
* parse
* log error if more than one spider found
* core.manager
* on crawl_* log message if multiple spiders found for url or request
* shell
* prints "Multiple found" if more than one spider found for url or request
* populate_vars(): added spider keyword parameter
* contrib.spidermanager:
* removed fromdomain() & fromurl()
* new create(spider_id) -> Spider. Raises KeyError if spider not found
* new find_by_request(request) -> list(spiders)
* added encoding aliases, configurable through a new ENCODING_ALIASES setting
* Response.encoding now returns the real encoding detected for the body
* simplified TextResponse API by removing body_encoding() and
headers_encoding() methods
* Response.encoding now tries to infer the encoding from the body always (it
was done before only on HtmlResponse and TextResponse)
* removed scrapy.utils.encoding.add_encoding_alias() function
* updated implementation of scrapy.utils.response function to reflect these API
changes
* updated documentation to reflect API changes