1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-24 00:23:52 +00:00
scrapy/sep/sep-019.rst
2013-03-07 12:34:17 -02:00

150 lines
4.9 KiB
ReStructuredText

SEP-019: Per-spider settings
============================
This is a proposal to add support for overrding settings per-spiders in a
consistent way.
In short, you will be able to overwrite settings (on a per-spider basis) by
implementing a class method in your spider::
def MySpider(BaseSpider):
@classmethod
def custom_settings(cls):
return {
"DOWNLOAD_DELAY": 5.0,
"RETRY_ENABLED": False,
}
Proposed changes
----------------
- new ``custom_settings`` class method will be added to spiders, to give them
a chance to override settings *before* they're used to instantiate the crawler
- new ``from_crawler`` class method will be added to spiders, to give spiders a
chance to access settings, stats, or the crawler core components themselves
- spider manager will be striped out of Crawler class
- ``SPIDER_MODULES`` setting will be removed and replaced by an entry on ``scrapy.cfg``
- Crawler object constructor will receive a spider class as (required) first argument
- add new settings to ``scrapy.cfg`` to define the spider manager class and
spider modules
- Settings class will be split into two classes: ``SettingsLoader`` and
``SettingsReader``, and a new concept of "setting priority" will be added
Settings
========
Settings class will be split into two classes ``SettingsLoader`` and ``SettingsReader``.
SettingsLoader
--------------
- used at startup (only) to populate settings, then converted to a SettingsReader and discarded
- will have a method ``set(name, value, priority)`` to register a setting with a given priority
SettingsReader
--------------
- used by core, extensions, et.al to configure themselves
- read-only
- this will be the one with methods: get, getint, getfloat, etc
- this will be the one accesible via ``crawler.settings``
Setting priorities
------------------
There will be 5 setting priorities used by default:
- 0: lobal defaults (those in ``scrapy.settings.default_settings``)
- 10: per-command defaults (for example, shell runs with ``KEEP_ALIVE=True``)
- 20: project settings (those in ``settings.py``)
- 30: per-spider settings (those returned by ``Spider.custom_settings`` class method)
- 40: command line arguments (those passed in the command line)
Spider manager
==============
Currently, the spider manager is part of the crawler which creates a cyclic
loop between settings and spiders and it shouldn't belong there. The spiders
should be loaded outside and passed to the crawler object, which will require a
spider class to be instantiated. It will need to be a class because when the
spider is instantiated the SettingsReader should already be available.
This new spider manager will not have access to the settings (they won't be
loaded yet) so it will use scrapy.cfg to configure itself.
The ``scrapy.cfg`` would look like this::
[settings]
default = myproject.settings
[spiders]
manager = scrapy.spidermanager.SpiderManager
modules = myproject.spiders
- ``manager`` replaces ``SPIDER_MANAGER_CLASS`` setting and can, if omitted,
will default to ``scrapy.spidermanager.SpiderManager``
- ``modules`` replaces ``SPIDER_MODULES`` setting and will be rquired
Startup process
===============
This describes the current and new proposed mechanism for starting up a Scrapy
crawler assuming one is running the following command::
scrapy crawl myspider -a arg=value -s DOWNLOAD_DELAY=5
Most of the code here happens in ``scrapy.cmdline`` and
``scrapy.commands.crawl`` modules, imports are omitted for brevity.
Current (old) startup process
-----------------------------
::
settings = get_project_settings() # loads settings.py
settings.overrides.update(DOWNLOAD_DELAY=5)
crawler = CrawlerProcess(settings)
crawler.configure()
# load extensions, middlewares, pipelines
spider = crawler.spiders.create('myspider', arg=value)
crawler.crawl(spider)
crawler.start()
# starts crawling spider
Proposed (new) startup process
------------------------------
::
smcls = get_spider_manager_class_from_scrapycfg()
sm = smcls() # loads spiders from module defined in scrapy.cfg
spidercls = sm.load('myspider') # NOTE: returns spider class, not instance
settings = get_project_settings() # loads settings.py
settings.set(DOWNLOAD_DELAY, 5, priority=40)
crawler = Crawler(spidercls, settings=settings)
settings.override(spidercls.custom_settings())
# load extensions, middlewares, pipelines
crawler.crawl(arg='value')
spider = self.spidercls.from_crawler(self, arg='value')
# starts crawling spider
What this solves
================
1. true overridable per-spider setting
2. supports accessing settings from spiders (currently not supported wihout
hacky code)
3. avoid mistakenly believing you can change settings after they have been
populated (you can, but they won't have any effect)
TODO
====
- should ``custom_settings`` be a static method instead of a class method?