2009-01-03 09:14:52 +00:00
|
|
|
.. _topics-settings:
|
|
|
|
|
2009-01-02 16:25:28 +00:00
|
|
|
========
|
|
|
|
Settings
|
|
|
|
========
|
2008-12-30 13:28:36 +00:00
|
|
|
|
|
|
|
The Scrapy settings allows you to customize the behaviour of all Scrapy
|
|
|
|
components, including the core, extensions, pipelines and spiders themselves.
|
|
|
|
|
2010-08-21 01:26:35 -03:00
|
|
|
The infrastructure of the settings provides a global namespace of key-value mappings
|
2009-08-29 18:20:13 -03:00
|
|
|
that the code can use to pull configuration values from. The settings can be
|
2008-12-30 13:28:36 +00:00
|
|
|
populated through different mechanisms, which are described below.
|
|
|
|
|
2010-08-21 01:26:35 -03:00
|
|
|
The settings are also the mechanism for selecting the currently active Scrapy
|
2009-08-29 18:20:13 -03:00
|
|
|
project (in case you have many).
|
2009-01-26 23:38:21 +00:00
|
|
|
|
2009-08-29 18:20:13 -03:00
|
|
|
For a list of available built-in settings see: :ref:`topics-settings-ref`.
|
|
|
|
|
2015-03-24 03:12:58 +00:00
|
|
|
.. _topics-settings-module-envvar:
|
|
|
|
|
2009-08-29 18:20:13 -03:00
|
|
|
Designating the settings
|
2008-12-30 13:28:36 +00:00
|
|
|
========================
|
|
|
|
|
2009-08-29 18:20:13 -03:00
|
|
|
When you use Scrapy, you have to tell it which settings you're using. You can
|
2010-09-05 04:58:14 -03:00
|
|
|
do this by using an environment variable, ``SCRAPY_SETTINGS_MODULE``.
|
2009-08-29 18:20:13 -03:00
|
|
|
|
|
|
|
The value of ``SCRAPY_SETTINGS_MODULE`` should be in Python path syntax, e.g.
|
|
|
|
``myproject.settings``. Note that the settings module should be on the
|
|
|
|
Python `import search path`_.
|
|
|
|
|
2015-02-06 22:46:18 +05:30
|
|
|
.. _import search path: https://docs.python.org/2/tutorial/modules.html#the-module-search-path
|
2009-08-29 18:20:13 -03:00
|
|
|
|
|
|
|
Populating the settings
|
|
|
|
=======================
|
|
|
|
|
2008-12-30 13:28:36 +00:00
|
|
|
Settings can be populated using different mechanisms, each of which having a
|
|
|
|
different precedence. Here is the list of them in decreasing order of
|
|
|
|
precedence:
|
|
|
|
|
2014-06-10 10:59:48 -03:00
|
|
|
1. Command line options (most precedence)
|
2014-08-13 01:42:34 -03:00
|
|
|
2. Settings per-spider
|
|
|
|
3. Project settings module
|
|
|
|
4. Default settings per-command
|
|
|
|
5. Default global settings (less precedence)
|
2008-12-30 13:28:36 +00:00
|
|
|
|
2014-06-10 10:59:48 -03:00
|
|
|
The population of these settings sources is taken care of internally, but a
|
|
|
|
manual handling is possible using API calls. See the
|
|
|
|
:ref:`topics-api-settings` topic for reference.
|
2008-12-30 13:28:36 +00:00
|
|
|
|
2014-06-10 10:59:48 -03:00
|
|
|
These mechanisms are described in more detail below.
|
2008-12-30 13:28:36 +00:00
|
|
|
|
2014-06-10 10:59:48 -03:00
|
|
|
1. Command line options
|
|
|
|
-----------------------
|
2008-12-30 13:28:36 +00:00
|
|
|
|
2014-06-10 10:59:48 -03:00
|
|
|
Arguments provided by the command line are the ones that take most precedence,
|
|
|
|
overriding any other options. You can explicitly override one (or more)
|
|
|
|
settings using the ``-s`` (or ``--set``) command line option.
|
2008-12-30 13:28:36 +00:00
|
|
|
|
2009-01-07 18:04:40 +00:00
|
|
|
.. highlight:: sh
|
|
|
|
|
2009-03-22 19:25:08 +00:00
|
|
|
Example::
|
|
|
|
|
2013-10-19 23:03:20 -04:00
|
|
|
scrapy crawl myspider -s LOG_FILE=scrapy.log
|
2009-03-22 19:25:08 +00:00
|
|
|
|
2014-08-13 01:42:34 -03:00
|
|
|
2. Settings per-spider
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
Spiders (See the :ref:`topics-spiders` chapter for reference) can define their
|
|
|
|
own settings that will take precedence and override the project ones. They can
|
2015-06-14 12:39:29 -03:00
|
|
|
do so by setting their :attr:`~scrapy.spiders.Spider.custom_settings` attribute::
|
|
|
|
|
|
|
|
class MySpider(scrapy.Spider):
|
|
|
|
name = 'myspider'
|
|
|
|
|
|
|
|
custom_settings = {
|
|
|
|
'SOME_SETTING': 'some value',
|
|
|
|
}
|
2014-08-13 01:42:34 -03:00
|
|
|
|
|
|
|
3. Project settings module
|
2010-09-22 15:47:33 -03:00
|
|
|
--------------------------
|
2008-12-30 13:28:36 +00:00
|
|
|
|
2010-09-22 15:47:33 -03:00
|
|
|
The project settings module is the standard configuration file for your Scrapy
|
2015-06-14 12:39:29 -03:00
|
|
|
project, it's where most of your custom settings will be populated. For a
|
|
|
|
standard Scrapy project, this means you'll be adding or changing the settings
|
|
|
|
in the ``settings.py`` file created for your project.
|
2008-12-30 13:28:36 +00:00
|
|
|
|
2014-08-13 01:42:34 -03:00
|
|
|
4. Default settings per-command
|
2009-01-07 18:04:40 +00:00
|
|
|
-------------------------------
|
|
|
|
|
2010-08-19 00:04:52 -03:00
|
|
|
Each :doc:`Scrapy tool </topics/commands>` command can have its own default
|
|
|
|
settings, which override the global default settings. Those custom command
|
|
|
|
settings are specified in the ``default_settings`` attribute of the command
|
|
|
|
class.
|
2009-01-07 18:04:40 +00:00
|
|
|
|
2014-08-13 01:42:34 -03:00
|
|
|
5. Default global settings
|
2009-01-07 18:04:40 +00:00
|
|
|
--------------------------
|
2008-12-30 13:28:36 +00:00
|
|
|
|
2010-09-22 15:47:33 -03:00
|
|
|
The global defaults are located in the ``scrapy.settings.default_settings``
|
|
|
|
module and documented in the :ref:`topics-settings-ref` section.
|
2008-12-30 13:28:36 +00:00
|
|
|
|
|
|
|
How to access settings
|
|
|
|
======================
|
|
|
|
|
2009-01-07 18:04:40 +00:00
|
|
|
.. highlight:: python
|
|
|
|
|
2015-06-14 12:39:29 -03:00
|
|
|
In a spider, the settings are available through ``self.settings``::
|
|
|
|
|
|
|
|
class MySpider(scrapy.Spider):
|
|
|
|
name = 'myspider'
|
|
|
|
start_urls = ['http://example.com']
|
|
|
|
|
|
|
|
def parse(self, response):
|
|
|
|
print("Existing settings: %s" % self.settings.attributes.keys())
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
The ``settings`` attribute is set in the base Spider class after the spider
|
|
|
|
is initialized. If you want to use the settings before the initialization
|
|
|
|
(e.g., in your spider's ``__init__()`` method), you'll need to override the
|
|
|
|
:meth:`~scrapy.spiders.Spider.from_crawler` method.
|
|
|
|
|
2012-08-28 18:31:03 -03:00
|
|
|
Settings can be accessed through the :attr:`scrapy.crawler.Crawler.settings`
|
|
|
|
attribute of the Crawler that is passed to ``from_crawler`` method in
|
2015-06-14 12:39:29 -03:00
|
|
|
extensions, middlewares and item pipelines::
|
2012-08-28 18:31:03 -03:00
|
|
|
|
|
|
|
class MyExtension(object):
|
2015-06-14 12:39:29 -03:00
|
|
|
def __init__(self, log_is_enabled=False):
|
|
|
|
if log_is_enabled:
|
|
|
|
print("log is enabled!")
|
2008-12-30 13:28:36 +00:00
|
|
|
|
2012-08-28 18:31:03 -03:00
|
|
|
@classmethod
|
|
|
|
def from_crawler(cls, crawler):
|
|
|
|
settings = crawler.settings
|
2015-06-14 12:39:29 -03:00
|
|
|
return cls(settings.getbool('LOG_ENABLED'))
|
2008-12-30 13:28:36 +00:00
|
|
|
|
2015-06-14 12:39:29 -03:00
|
|
|
The settings object can be used like a dict (e.g.,
|
|
|
|
``settings['LOG_ENABLED']``), but it's usually preferred to extract the setting
|
|
|
|
in the format you need it to avoid type errors, using one of the methods
|
|
|
|
provided by the :class:`~scrapy.settings.Settings` API.
|
2009-01-11 20:04:13 +00:00
|
|
|
|
2008-12-30 13:28:36 +00:00
|
|
|
Rationale for setting names
|
|
|
|
===========================
|
|
|
|
|
|
|
|
Setting names are usually prefixed with the component that they configure. For
|
2009-01-07 18:04:40 +00:00
|
|
|
example, proper setting names for a fictional robots.txt extension would be
|
2008-12-30 13:28:36 +00:00
|
|
|
``ROBOTSTXT_ENABLED``, ``ROBOTSTXT_OBEY``, ``ROBOTSTXT_CACHEDIR``, etc.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
|
|
|
|
.. _topics-settings-ref:
|
|
|
|
|
|
|
|
Built-in settings reference
|
|
|
|
===========================
|
|
|
|
|
|
|
|
Here's a list of all available Scrapy settings, in alphabetical order, along
|
2014-01-21 18:25:17 +01:00
|
|
|
with their default values and the scope where they apply.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
The scope, where available, shows where the setting is being used, if it's tied
|
|
|
|
to any particular component. In that case the module of that component will be
|
|
|
|
shown, typically an extension, middleware or pipeline. It also means that the
|
|
|
|
component must be enabled in order for the setting to have any effect.
|
|
|
|
|
2010-08-17 14:27:48 -03:00
|
|
|
.. setting:: AWS_ACCESS_KEY_ID
|
|
|
|
|
|
|
|
AWS_ACCESS_KEY_ID
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
Default: ``None``
|
|
|
|
|
|
|
|
The AWS access key used by code that requires access to `Amazon Web services`_,
|
|
|
|
such as the :ref:`S3 feed storage backend <topics-feed-storage-s3>`.
|
|
|
|
|
|
|
|
.. setting:: AWS_SECRET_ACCESS_KEY
|
|
|
|
|
|
|
|
AWS_SECRET_ACCESS_KEY
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
Default: ``None``
|
|
|
|
|
|
|
|
The AWS secret key used by code that requires access to `Amazon Web services`_,
|
|
|
|
such as the :ref:`S3 feed storage backend <topics-feed-storage-s3>`.
|
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
.. setting:: BOT_NAME
|
|
|
|
|
|
|
|
BOT_NAME
|
|
|
|
--------
|
|
|
|
|
2010-08-17 14:27:48 -03:00
|
|
|
Default: ``'scrapybot'``
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2009-08-29 04:29:47 -03:00
|
|
|
The name of the bot implemented by this Scrapy project (also known as the
|
|
|
|
project name). This will be used to construct the User-Agent by default, and
|
|
|
|
also for logging.
|
|
|
|
|
|
|
|
It's automatically populated with your project name when you create your
|
2010-08-19 00:04:52 -03:00
|
|
|
project with the :command:`startproject` command.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
.. setting:: CONCURRENT_ITEMS
|
|
|
|
|
|
|
|
CONCURRENT_ITEMS
|
|
|
|
----------------
|
|
|
|
|
|
|
|
Default: ``100``
|
|
|
|
|
|
|
|
Maximum number of concurrent items (per response) to process in parallel in the
|
2010-01-31 18:11:13 -02:00
|
|
|
Item Processor (also known as the :ref:`Item Pipeline <topics-item-pipeline>`).
|
|
|
|
|
2011-07-27 13:38:09 -03:00
|
|
|
.. setting:: CONCURRENT_REQUESTS
|
2010-01-31 18:11:13 -02:00
|
|
|
|
2011-07-27 13:38:09 -03:00
|
|
|
CONCURRENT_REQUESTS
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
Default: ``16``
|
|
|
|
|
|
|
|
The maximum number of concurrent (ie. simultaneous) requests that will be
|
|
|
|
performed by the Scrapy downloader.
|
|
|
|
|
|
|
|
.. setting:: CONCURRENT_REQUESTS_PER_DOMAIN
|
|
|
|
|
|
|
|
CONCURRENT_REQUESTS_PER_DOMAIN
|
2010-01-31 18:11:13 -02:00
|
|
|
------------------------------
|
|
|
|
|
|
|
|
Default: ``8``
|
|
|
|
|
2011-07-27 13:38:09 -03:00
|
|
|
The maximum number of concurrent (ie. simultaneous) requests that will be
|
|
|
|
performed to any single domain.
|
|
|
|
|
2015-06-27 04:59:42 +05:00
|
|
|
See also: :ref:`topics-autothrottle` and its
|
|
|
|
:setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` option.
|
|
|
|
|
|
|
|
|
2011-07-27 13:38:09 -03:00
|
|
|
.. setting:: CONCURRENT_REQUESTS_PER_IP
|
|
|
|
|
|
|
|
CONCURRENT_REQUESTS_PER_IP
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
Default: ``0``
|
|
|
|
|
|
|
|
The maximum number of concurrent (ie. simultaneous) requests that will be
|
|
|
|
performed to any single IP. If non-zero, the
|
|
|
|
:setting:`CONCURRENT_REQUESTS_PER_DOMAIN` setting is ignored, and this one is
|
|
|
|
used instead. In other words, concurrency limits will be applied per IP, not
|
|
|
|
per domain.
|
2010-01-31 18:11:13 -02:00
|
|
|
|
2015-06-27 04:59:42 +05:00
|
|
|
This setting also affects :setting:`DOWNLOAD_DELAY` and
|
|
|
|
:ref:`topics-autothrottle`: if :setting:`CONCURRENT_REQUESTS_PER_IP`
|
|
|
|
is non-zero, download delay is enforced per IP, not per domain.
|
2013-12-28 06:30:34 +06:00
|
|
|
|
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
.. setting:: DEFAULT_ITEM_CLASS
|
|
|
|
|
|
|
|
DEFAULT_ITEM_CLASS
|
|
|
|
------------------
|
|
|
|
|
2009-08-19 21:39:58 -03:00
|
|
|
Default: ``'scrapy.item.Item'``
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
The default class that will be used for instantiating items in the :ref:`the
|
|
|
|
Scrapy shell <topics-shell>`.
|
|
|
|
|
|
|
|
.. setting:: DEFAULT_REQUEST_HEADERS
|
|
|
|
|
|
|
|
DEFAULT_REQUEST_HEADERS
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
Default::
|
|
|
|
|
|
|
|
{
|
|
|
|
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
|
|
|
|
'Accept-Language': 'en',
|
|
|
|
}
|
|
|
|
|
|
|
|
The default headers used for Scrapy HTTP Requests. They're populated in the
|
2015-04-20 21:23:05 -03:00
|
|
|
:class:`~scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware`.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
.. setting:: DEPTH_LIMIT
|
|
|
|
|
|
|
|
DEPTH_LIMIT
|
|
|
|
-----------
|
|
|
|
|
|
|
|
Default: ``0``
|
|
|
|
|
2016-01-27 11:53:29 +01:00
|
|
|
Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``
|
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
The maximum depth that will be allowed to crawl for any site. If zero, no limit
|
|
|
|
will be imposed.
|
|
|
|
|
2011-08-02 11:57:55 -03:00
|
|
|
.. setting:: DEPTH_PRIORITY
|
|
|
|
|
|
|
|
DEPTH_PRIORITY
|
|
|
|
--------------
|
|
|
|
|
2011-09-23 13:22:25 -03:00
|
|
|
Default: ``0``
|
2011-08-02 11:57:55 -03:00
|
|
|
|
2016-01-27 11:53:29 +01:00
|
|
|
Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``
|
|
|
|
|
|
|
|
An integer that is used to adjust the request priority based on its depth:
|
|
|
|
|
2016-01-27 12:57:03 +01:00
|
|
|
- if zero (default), no priority adjustment is made from depth
|
|
|
|
- **a positive value will decrease the priority, i.e. higher depth
|
|
|
|
requests will be processed later** ; this is commonly used when doing
|
|
|
|
breadth-first crawls (BFO)
|
|
|
|
- a negative value will increase priority, i.e., higher depth requests
|
|
|
|
will be processed sooner (DFO)
|
|
|
|
|
|
|
|
See also: :ref:`faq-bfo-dfo` about tuning Scrapy for BFO or DFO.
|
2011-08-02 11:57:55 -03:00
|
|
|
|
2016-01-27 11:53:29 +01:00
|
|
|
.. note::
|
|
|
|
|
|
|
|
This setting adjusts priority **in the opposite way** compared to
|
|
|
|
other priority settings :setting:`REDIRECT_PRIORITY_ADJUST`
|
|
|
|
and :setting:`RETRY_PRIORITY_ADJUST`.
|
2011-08-02 11:57:55 -03:00
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
.. setting:: DEPTH_STATS
|
|
|
|
|
|
|
|
DEPTH_STATS
|
|
|
|
-----------
|
|
|
|
|
|
|
|
Default: ``True``
|
|
|
|
|
2016-01-27 11:53:29 +01:00
|
|
|
Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``
|
|
|
|
|
2011-05-18 11:04:48 -03:00
|
|
|
Whether to collect maximum depth stats.
|
|
|
|
|
|
|
|
.. setting:: DEPTH_STATS_VERBOSE
|
|
|
|
|
|
|
|
DEPTH_STATS_VERBOSE
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
Default: ``False``
|
|
|
|
|
2016-01-27 11:53:29 +01:00
|
|
|
Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``
|
|
|
|
|
2011-05-18 11:04:48 -03:00
|
|
|
Whether to collect verbose depth stats. If this is enabled, the number of
|
|
|
|
requests for each depth is collected in the stats.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2011-08-05 20:41:59 -03:00
|
|
|
.. setting:: DNSCACHE_ENABLED
|
|
|
|
|
|
|
|
DNSCACHE_ENABLED
|
|
|
|
----------------
|
|
|
|
|
|
|
|
Default: ``True``
|
|
|
|
|
|
|
|
Whether to enable DNS in-memory cache.
|
|
|
|
|
2015-04-02 18:30:59 +02:00
|
|
|
.. setting:: DNSCACHE_SIZE
|
|
|
|
|
|
|
|
DNSCACHE_SIZE
|
|
|
|
----------------
|
|
|
|
|
|
|
|
Default: ``10000``
|
|
|
|
|
|
|
|
DNS in-memory cache size.
|
|
|
|
|
|
|
|
.. setting:: DNS_TIMEOUT
|
|
|
|
|
|
|
|
DNS_TIMEOUT
|
|
|
|
----------------
|
|
|
|
|
|
|
|
Default: ``60``
|
|
|
|
|
|
|
|
Timeout for processing of DNS queries in seconds. Float is supported.
|
|
|
|
|
2014-06-02 13:05:22 +03:00
|
|
|
.. setting:: DOWNLOADER
|
|
|
|
|
|
|
|
DOWNLOADER
|
2014-06-02 13:05:22 +03:00
|
|
|
----------
|
2014-06-02 13:05:22 +03:00
|
|
|
|
|
|
|
Default: ``'scrapy.core.downloader.Downloader'``
|
|
|
|
|
|
|
|
The downloader to use for crawling.
|
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
.. setting:: DOWNLOADER_MIDDLEWARES
|
|
|
|
|
|
|
|
DOWNLOADER_MIDDLEWARES
|
|
|
|
----------------------
|
|
|
|
|
2015-11-06 01:14:49 +01:00
|
|
|
Default:: ``{}``
|
|
|
|
|
|
|
|
A dict containing the downloader middlewares enabled in your project, and their
|
|
|
|
orders. For more info see :ref:`topics-downloader-middleware-setting`.
|
|
|
|
|
|
|
|
.. setting:: DOWNLOADER_MIDDLEWARES_BASE
|
|
|
|
|
|
|
|
DOWNLOADER_MIDDLEWARES_BASE
|
|
|
|
---------------------------
|
|
|
|
|
2014-01-21 18:25:17 +01:00
|
|
|
Default::
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
{
|
2015-04-20 21:23:05 -03:00
|
|
|
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
|
|
|
|
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
|
|
|
|
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
|
|
|
|
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
|
|
|
|
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 500,
|
|
|
|
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 550,
|
2015-06-19 15:09:36 +02:00
|
|
|
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
|
2015-04-20 21:23:05 -03:00
|
|
|
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
|
|
|
|
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
|
|
|
|
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
|
|
|
|
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
|
|
|
|
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
|
|
|
|
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware': 830,
|
|
|
|
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
|
|
|
|
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
|
2009-08-18 14:05:15 -03:00
|
|
|
}
|
|
|
|
|
2015-11-06 01:14:49 +01:00
|
|
|
A dict containing the downloader middlewares enabled by default in Scrapy. Low
|
|
|
|
orders are closer to the engine, high orders are closer to the downloader. You
|
|
|
|
should never modify this setting in your project, modify
|
|
|
|
:setting:`DOWNLOADER_MIDDLEWARES` instead. For more info see
|
|
|
|
:ref:`topics-downloader-middleware-setting`.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
.. setting:: DOWNLOADER_STATS
|
|
|
|
|
|
|
|
DOWNLOADER_STATS
|
|
|
|
----------------
|
|
|
|
|
|
|
|
Default: ``True``
|
|
|
|
|
|
|
|
Whether to enable downloader stats collection.
|
|
|
|
|
|
|
|
.. setting:: DOWNLOAD_DELAY
|
|
|
|
|
|
|
|
DOWNLOAD_DELAY
|
|
|
|
--------------
|
|
|
|
|
|
|
|
Default: ``0``
|
|
|
|
|
|
|
|
The amount of time (in secs) that the downloader should wait before downloading
|
2013-12-28 06:30:34 +06:00
|
|
|
consecutive pages from the same website. This can be used to throttle the
|
2009-08-18 14:05:15 -03:00
|
|
|
crawling speed to avoid hitting servers too hard. Decimal numbers are
|
|
|
|
supported. Example::
|
|
|
|
|
2013-12-28 06:30:34 +06:00
|
|
|
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2010-02-19 21:53:18 -02:00
|
|
|
This setting is also affected by the :setting:`RANDOMIZE_DOWNLOAD_DELAY`
|
|
|
|
setting (which is enabled by default). By default, Scrapy doesn't wait a fixed
|
|
|
|
amount of time between requests, but uses a random interval between 0.5 and 1.5
|
|
|
|
* :setting:`DOWNLOAD_DELAY`.
|
|
|
|
|
2013-12-28 06:30:34 +06:00
|
|
|
When :setting:`CONCURRENT_REQUESTS_PER_IP` is non-zero, delays are enforced
|
|
|
|
per ip address instead of per domain.
|
|
|
|
|
|
|
|
You can also change this setting per spider by setting ``download_delay``
|
|
|
|
spider attribute.
|
2010-02-19 21:53:18 -02:00
|
|
|
|
2010-09-05 19:35:53 -03:00
|
|
|
.. setting:: DOWNLOAD_HANDLERS
|
|
|
|
|
|
|
|
DOWNLOAD_HANDLERS
|
|
|
|
-----------------
|
|
|
|
|
2015-11-06 01:14:49 +01:00
|
|
|
Default: ``{}``
|
|
|
|
|
|
|
|
A dict containing the request downloader handlers enabled in your project.
|
|
|
|
See :setting:`DOWNLOAD_HANDLERS_BASE` for example format.
|
|
|
|
|
|
|
|
.. setting:: DOWNLOAD_HANDLERS_BASE
|
|
|
|
|
|
|
|
DOWNLOAD_HANDLERS_BASE
|
|
|
|
----------------------
|
|
|
|
|
2014-01-21 18:25:17 +01:00
|
|
|
Default::
|
2010-09-05 19:35:53 -03:00
|
|
|
|
|
|
|
{
|
|
|
|
'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
|
2015-06-19 15:09:36 +02:00
|
|
|
'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
|
|
|
|
'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
|
2010-09-05 19:35:53 -03:00
|
|
|
's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
|
2015-06-19 15:09:36 +02:00
|
|
|
'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',
|
2010-09-05 19:35:53 -03:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2015-11-06 01:14:49 +01:00
|
|
|
A dict containing the request download handlers enabled by default in Scrapy.
|
|
|
|
You should never modify this setting in your project, modify
|
|
|
|
:setting:`DOWNLOAD_HANDLERS` instead.
|
2015-06-19 15:09:36 +02:00
|
|
|
|
2015-11-06 01:14:49 +01:00
|
|
|
You can disable any of these download handlers by assigning ``None`` to their
|
|
|
|
URI scheme in :setting:`DOWNLOAD_HANDLERS`. E.g., to disable the built-in FTP
|
|
|
|
handler (without replacement), place this in your ``settings.py``::
|
2014-03-12 23:21:33 -03:00
|
|
|
|
|
|
|
DOWNLOAD_HANDLERS = {
|
2015-11-06 01:14:49 +01:00
|
|
|
'ftp': None,
|
2014-03-12 23:21:33 -03:00
|
|
|
}
|
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
.. setting:: DOWNLOAD_TIMEOUT
|
|
|
|
|
|
|
|
DOWNLOAD_TIMEOUT
|
|
|
|
----------------
|
|
|
|
|
|
|
|
Default: ``180``
|
|
|
|
|
|
|
|
The amount of time (in secs) that the downloader will wait before timing out.
|
|
|
|
|
2014-09-23 02:59:35 +06:00
|
|
|
.. note::
|
|
|
|
|
|
|
|
This timeout can be set per spider using :attr:`download_timeout`
|
|
|
|
spider attribute and per-request using :reqmeta:`download_timeout`
|
|
|
|
Request.meta key.
|
|
|
|
|
2014-11-12 12:28:02 +01:00
|
|
|
.. setting:: DOWNLOAD_MAXSIZE
|
|
|
|
|
|
|
|
DOWNLOAD_MAXSIZE
|
|
|
|
----------------
|
|
|
|
|
2014-11-19 11:50:07 +01:00
|
|
|
Default: `1073741824` (1024MB)
|
2014-11-12 12:28:02 +01:00
|
|
|
|
|
|
|
The maximum response size (in bytes) that downloader will download.
|
|
|
|
|
|
|
|
If you want to disable it set to 0.
|
|
|
|
|
2015-03-10 22:29:07 -03:00
|
|
|
.. reqmeta:: download_maxsize
|
|
|
|
|
2014-11-12 12:28:02 +01:00
|
|
|
.. note::
|
|
|
|
|
|
|
|
This size can be set per spider using :attr:`download_maxsize`
|
|
|
|
spider attribute and per-request using :reqmeta:`download_maxsize`
|
|
|
|
Request.meta key.
|
|
|
|
|
2014-11-19 11:50:07 +01:00
|
|
|
This feature needs Twisted >= 11.1.
|
|
|
|
|
2014-11-12 12:28:02 +01:00
|
|
|
.. setting:: DOWNLOAD_WARNSIZE
|
|
|
|
|
|
|
|
DOWNLOAD_WARNSIZE
|
2014-11-19 11:50:07 +01:00
|
|
|
-----------------
|
2014-11-12 12:28:02 +01:00
|
|
|
|
2014-11-19 11:50:07 +01:00
|
|
|
Default: `33554432` (32MB)
|
2014-11-12 12:28:02 +01:00
|
|
|
|
|
|
|
The response size (in bytes) that downloader will start to warn.
|
|
|
|
|
|
|
|
If you want to disable it set to 0.
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
|
|
|
|
This size can be set per spider using :attr:`download_warnsize`
|
|
|
|
spider attribute and per-request using :reqmeta:`download_warnsize`
|
|
|
|
Request.meta key.
|
|
|
|
|
2014-11-19 11:50:07 +01:00
|
|
|
This feature needs Twisted >= 11.1.
|
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
.. setting:: DUPEFILTER_CLASS
|
|
|
|
|
|
|
|
DUPEFILTER_CLASS
|
|
|
|
----------------
|
|
|
|
|
2015-04-23 11:57:14 -03:00
|
|
|
Default: ``'scrapy.dupefilters.RFPDupeFilter'``
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
The class used to detect and filter duplicate requests.
|
|
|
|
|
2011-08-02 11:57:55 -03:00
|
|
|
The default (``RFPDupeFilter``) filters based on request fingerprint using
|
2014-02-15 17:48:32 +02:00
|
|
|
the ``scrapy.utils.request.request_fingerprint`` function. In order to change
|
|
|
|
the way duplicates are checked you could subclass ``RFPDupeFilter`` and
|
2014-04-26 15:46:53 +03:00
|
|
|
override its ``request_fingerprint`` method. This method should accept
|
|
|
|
scrapy :class:`~scrapy.http.Request` object and return its fingerprint
|
|
|
|
(a string).
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2014-01-21 18:25:17 +01:00
|
|
|
.. setting:: DUPEFILTER_DEBUG
|
|
|
|
|
|
|
|
DUPEFILTER_DEBUG
|
|
|
|
----------------
|
|
|
|
|
|
|
|
Default: ``False``
|
|
|
|
|
|
|
|
By default, ``RFPDupeFilter`` only logs the first duplicate request.
|
|
|
|
Setting :setting:`DUPEFILTER_DEBUG` to ``True`` will make it log all duplicate requests.
|
|
|
|
|
|
|
|
.. setting:: EDITOR
|
2011-06-05 22:02:56 -03:00
|
|
|
|
|
|
|
EDITOR
|
|
|
|
------
|
|
|
|
|
2012-08-28 18:31:03 -03:00
|
|
|
Default: `depends on the environment`
|
|
|
|
|
2011-06-05 22:02:56 -03:00
|
|
|
The editor to use for editing spiders with the :command:`edit` command. It
|
|
|
|
defaults to the ``EDITOR`` environment variable, if set. Otherwise, it defaults
|
|
|
|
to ``vi`` (on Unix systems) or the IDLE editor (on Windows).
|
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
.. setting:: EXTENSIONS
|
|
|
|
|
|
|
|
EXTENSIONS
|
|
|
|
----------
|
|
|
|
|
2015-11-06 01:14:49 +01:00
|
|
|
Default:: ``{}``
|
|
|
|
|
|
|
|
A dict containing the extensions enabled in your project, and their orders.
|
|
|
|
|
|
|
|
.. setting:: EXTENSIONS_BASE
|
|
|
|
|
|
|
|
EXTENSIONS_BASE
|
|
|
|
---------------
|
|
|
|
|
2011-05-18 14:46:20 -03:00
|
|
|
Default::
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
{
|
2015-04-21 13:48:28 -03:00
|
|
|
'scrapy.extensions.corestats.CoreStats': 0,
|
2015-10-05 13:26:40 +03:00
|
|
|
'scrapy.extensions.telnet.TelnetConsole': 0,
|
2015-04-21 13:48:28 -03:00
|
|
|
'scrapy.extensions.memusage.MemoryUsage': 0,
|
|
|
|
'scrapy.extensions.memdebug.MemoryDebugger': 0,
|
|
|
|
'scrapy.extensions.closespider.CloseSpider': 0,
|
|
|
|
'scrapy.extensions.feedexport.FeedExporter': 0,
|
|
|
|
'scrapy.extensions.logstats.LogStats': 0,
|
|
|
|
'scrapy.extensions.spiderstate.SpiderState': 0,
|
|
|
|
'scrapy.extensions.throttle.AutoThrottle': 0,
|
2009-08-18 14:05:15 -03:00
|
|
|
}
|
|
|
|
|
2015-11-06 01:14:49 +01:00
|
|
|
A dict containing the extensions available by default in Scrapy, and their
|
|
|
|
orders. This setting contains all stable built-in extensions. Keep in mind that
|
2015-06-19 15:09:36 +02:00
|
|
|
some of them need to be enabled through a setting.
|
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
For more information See the :ref:`extensions user guide <topics-extensions>`
|
|
|
|
and the :ref:`list of available extensions <topics-extensions-ref>`.
|
|
|
|
|
2015-06-19 15:09:36 +02:00
|
|
|
|
2016-02-24 10:16:10 +03:00
|
|
|
.. setting:: FILES_STORE_S3_ACL
|
|
|
|
|
|
|
|
FILES_STORE_S3_ACL
|
|
|
|
------------------
|
|
|
|
|
|
|
|
Default: ``'private'``
|
|
|
|
|
|
|
|
S3-specific access control policy (ACL) for S3 files store.
|
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
.. setting:: ITEM_PIPELINES
|
|
|
|
|
|
|
|
ITEM_PIPELINES
|
|
|
|
--------------
|
|
|
|
|
2013-09-23 16:41:58 -03:00
|
|
|
Default: ``{}``
|
|
|
|
|
2015-06-19 15:09:36 +02:00
|
|
|
A dict containing the item pipelines to use, and their orders. Order values are
|
|
|
|
arbitrary, but it is customary to define them in the 0-1000 range. Lower orders
|
|
|
|
process before higher orders.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
Example::
|
|
|
|
|
2013-09-23 16:41:58 -03:00
|
|
|
ITEM_PIPELINES = {
|
2014-02-15 10:59:56 -04:00
|
|
|
'mybot.pipelines.validate.ValidateMyItem': 300,
|
|
|
|
'mybot.pipelines.validate.StoreMyItem': 800,
|
2013-09-23 16:41:58 -03:00
|
|
|
}
|
|
|
|
|
2015-11-06 01:14:49 +01:00
|
|
|
.. setting:: ITEM_PIPELINES_BASE
|
|
|
|
|
|
|
|
ITEM_PIPELINES_BASE
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
Default: ``{}``
|
|
|
|
|
|
|
|
A dict containing the pipelines enabled by default in Scrapy. You should never
|
|
|
|
modify this setting in your project, modify :setting:`ITEM_PIPELINES` instead.
|
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
.. setting:: LOG_ENABLED
|
|
|
|
|
|
|
|
LOG_ENABLED
|
|
|
|
-----------
|
|
|
|
|
|
|
|
Default: ``True``
|
|
|
|
|
2010-03-24 12:13:38 -03:00
|
|
|
Whether to enable logging.
|
|
|
|
|
|
|
|
.. setting:: LOG_ENCODING
|
|
|
|
|
|
|
|
LOG_ENCODING
|
|
|
|
------------
|
|
|
|
|
|
|
|
Default: ``'utf-8'``
|
|
|
|
|
|
|
|
The encoding to use for logging.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2009-08-21 08:54:12 -03:00
|
|
|
.. setting:: LOG_FILE
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2009-08-21 08:54:12 -03:00
|
|
|
LOG_FILE
|
|
|
|
--------
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
Default: ``None``
|
|
|
|
|
2015-06-19 15:09:36 +02:00
|
|
|
File name to use for logging output. If ``None``, standard error will be used.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2015-03-05 05:04:21 -03:00
|
|
|
.. setting:: LOG_FORMAT
|
|
|
|
|
|
|
|
LOG_FORMAT
|
|
|
|
----------
|
|
|
|
|
|
|
|
Default: ``'%(asctime)s [%(name)s] %(levelname)s: %(message)s'``
|
|
|
|
|
|
|
|
String for formatting log messsages. Refer to the `Python logging documentation`_ for the whole list of available
|
|
|
|
placeholders.
|
|
|
|
|
|
|
|
.. _Python logging documentation: https://docs.python.org/2/library/logging.html#logrecord-attributes
|
|
|
|
|
|
|
|
.. setting:: LOG_DATEFORMAT
|
|
|
|
|
|
|
|
LOG_DATEFORMAT
|
|
|
|
--------------
|
|
|
|
|
2015-06-04 03:51:48 +08:00
|
|
|
Default: ``'%Y-%m-%d %H:%M:%S'``
|
2015-03-05 05:04:21 -03:00
|
|
|
|
|
|
|
String for formatting date/time, expansion of the ``%(asctime)s`` placeholder
|
|
|
|
in :setting:`LOG_FORMAT`. Refer to the `Python datetime documentation`_ for the whole list of available
|
|
|
|
directives.
|
|
|
|
|
|
|
|
.. _Python datetime documentation: https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
|
|
|
|
|
2009-08-21 08:54:12 -03:00
|
|
|
.. setting:: LOG_LEVEL
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2009-08-21 08:54:12 -03:00
|
|
|
LOG_LEVEL
|
|
|
|
---------
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
Default: ``'DEBUG'``
|
|
|
|
|
2009-08-20 18:17:48 -03:00
|
|
|
Minimum level to log. Available levels are: CRITICAL, ERROR, WARNING,
|
|
|
|
INFO, DEBUG. For more info see :ref:`topics-logging`.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2009-08-21 08:54:12 -03:00
|
|
|
.. setting:: LOG_STDOUT
|
|
|
|
|
|
|
|
LOG_STDOUT
|
|
|
|
----------
|
|
|
|
|
|
|
|
Default: ``False``
|
|
|
|
|
2010-01-13 15:51:08 -02:00
|
|
|
If ``True``, all standard output (and error) of your process will be redirected
|
|
|
|
to the log. For example if you ``print 'hello'`` it will appear in the Scrapy
|
|
|
|
log.
|
2009-08-21 08:54:12 -03:00
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
.. setting:: MEMDEBUG_ENABLED
|
|
|
|
|
|
|
|
MEMDEBUG_ENABLED
|
|
|
|
----------------
|
|
|
|
|
|
|
|
Default: ``False``
|
|
|
|
|
|
|
|
Whether to enable memory debugging.
|
|
|
|
|
|
|
|
.. setting:: MEMDEBUG_NOTIFY
|
|
|
|
|
|
|
|
MEMDEBUG_NOTIFY
|
|
|
|
---------------
|
|
|
|
|
|
|
|
Default: ``[]``
|
|
|
|
|
|
|
|
When memory debugging is enabled a memory report will be sent to the specified
|
|
|
|
addresses if this setting is not empty, otherwise the report will be written to
|
|
|
|
the log.
|
|
|
|
|
|
|
|
Example::
|
|
|
|
|
|
|
|
MEMDEBUG_NOTIFY = ['user@example.com']
|
|
|
|
|
|
|
|
.. setting:: MEMUSAGE_ENABLED
|
|
|
|
|
|
|
|
MEMUSAGE_ENABLED
|
|
|
|
----------------
|
|
|
|
|
|
|
|
Default: ``False``
|
|
|
|
|
2015-04-21 13:48:28 -03:00
|
|
|
Scope: ``scrapy.extensions.memusage``
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
Whether to enable the memory usage extension that will shutdown the Scrapy
|
|
|
|
process when it exceeds a memory limit, and also notify by email when that
|
|
|
|
happened.
|
|
|
|
|
|
|
|
See :ref:`topics-extensions-ref-memusage`.
|
|
|
|
|
|
|
|
.. setting:: MEMUSAGE_LIMIT_MB
|
|
|
|
|
|
|
|
MEMUSAGE_LIMIT_MB
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
Default: ``0``
|
|
|
|
|
2015-04-21 13:48:28 -03:00
|
|
|
Scope: ``scrapy.extensions.memusage``
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
The maximum amount of memory to allow (in megabytes) before shutting down
|
|
|
|
Scrapy (if MEMUSAGE_ENABLED is True). If zero, no check will be performed.
|
|
|
|
|
|
|
|
See :ref:`topics-extensions-ref-memusage`.
|
|
|
|
|
2015-06-06 00:39:14 +10:00
|
|
|
.. setting:: MEMUSAGE_CHECK_INTERVAL_SECONDS
|
|
|
|
|
|
|
|
MEMUSAGE_CHECK_INTERVAL_SECONDS
|
|
|
|
-------------------------------
|
|
|
|
|
2016-02-03 12:35:26 +01:00
|
|
|
.. versionadded:: 1.1
|
|
|
|
|
2015-06-06 00:39:14 +10:00
|
|
|
Default: ``60.0``
|
|
|
|
|
|
|
|
Scope: ``scrapy.extensions.memusage``
|
|
|
|
|
2015-06-06 11:18:13 +10:00
|
|
|
The :ref:`Memory usage extension <topics-extensions-ref-memusage>`
|
2016-01-26 17:47:46 +05:00
|
|
|
checks the current memory usage, versus the limits set by
|
|
|
|
:setting:`MEMUSAGE_LIMIT_MB` and :setting:`MEMUSAGE_WARNING_MB`,
|
2015-06-06 11:18:13 +10:00
|
|
|
at fixed time intervals.
|
|
|
|
|
|
|
|
This sets the length of these intervals, in seconds.
|
2015-06-06 00:39:14 +10:00
|
|
|
|
|
|
|
See :ref:`topics-extensions-ref-memusage`.
|
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
.. setting:: MEMUSAGE_NOTIFY_MAIL
|
|
|
|
|
|
|
|
MEMUSAGE_NOTIFY_MAIL
|
|
|
|
--------------------
|
|
|
|
|
|
|
|
Default: ``False``
|
|
|
|
|
2015-04-21 13:48:28 -03:00
|
|
|
Scope: ``scrapy.extensions.memusage``
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
A list of emails to notify if the memory limit has been reached.
|
|
|
|
|
|
|
|
Example::
|
|
|
|
|
|
|
|
MEMUSAGE_NOTIFY_MAIL = ['user@example.com']
|
|
|
|
|
|
|
|
See :ref:`topics-extensions-ref-memusage`.
|
|
|
|
|
|
|
|
.. setting:: MEMUSAGE_REPORT
|
|
|
|
|
|
|
|
MEMUSAGE_REPORT
|
|
|
|
---------------
|
|
|
|
|
|
|
|
Default: ``False``
|
|
|
|
|
2015-04-21 13:48:28 -03:00
|
|
|
Scope: ``scrapy.extensions.memusage``
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2011-05-18 14:46:20 -03:00
|
|
|
Whether to send a memory usage report after each spider has been closed.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
See :ref:`topics-extensions-ref-memusage`.
|
|
|
|
|
|
|
|
.. setting:: MEMUSAGE_WARNING_MB
|
|
|
|
|
|
|
|
MEMUSAGE_WARNING_MB
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
Default: ``0``
|
|
|
|
|
2015-04-21 13:48:28 -03:00
|
|
|
Scope: ``scrapy.extensions.memusage``
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
The maximum amount of memory to allow (in megabytes) before sending a warning
|
|
|
|
email notifying about it. If zero, no warning will be produced.
|
|
|
|
|
|
|
|
.. setting:: NEWSPIDER_MODULE
|
|
|
|
|
|
|
|
NEWSPIDER_MODULE
|
|
|
|
----------------
|
|
|
|
|
|
|
|
Default: ``''``
|
|
|
|
|
2010-08-19 00:04:52 -03:00
|
|
|
Module where to create new spiders using the :command:`genspider` command.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
Example::
|
|
|
|
|
|
|
|
NEWSPIDER_MODULE = 'mybot.spiders_dev'
|
|
|
|
|
2010-02-19 21:53:18 -02:00
|
|
|
.. setting:: RANDOMIZE_DOWNLOAD_DELAY
|
|
|
|
|
|
|
|
RANDOMIZE_DOWNLOAD_DELAY
|
|
|
|
------------------------
|
|
|
|
|
|
|
|
Default: ``True``
|
|
|
|
|
|
|
|
If enabled, Scrapy will wait a random amount of time (between 0.5 and 1.5
|
|
|
|
* :setting:`DOWNLOAD_DELAY`) while fetching requests from the same
|
2013-12-28 06:30:34 +06:00
|
|
|
website.
|
2010-02-19 21:53:18 -02:00
|
|
|
|
|
|
|
This randomization decreases the chance of the crawler being detected (and
|
|
|
|
subsequently blocked) by sites which analyze requests looking for statistically
|
2010-08-21 01:26:35 -03:00
|
|
|
significant similarities in the time between their requests.
|
2010-02-19 21:53:18 -02:00
|
|
|
|
|
|
|
The randomization policy is the same used by `wget`_ ``--random-wait`` option.
|
|
|
|
|
|
|
|
If :setting:`DOWNLOAD_DELAY` is zero (default) this option has no effect.
|
|
|
|
|
|
|
|
.. _wget: http://www.gnu.org/software/wget/manual/wget.html
|
|
|
|
|
2015-03-31 11:10:56 +02:00
|
|
|
.. setting:: REACTOR_THREADPOOL_MAXSIZE
|
|
|
|
|
|
|
|
REACTOR_THREADPOOL_MAXSIZE
|
2015-04-01 11:25:10 +02:00
|
|
|
--------------------------
|
2015-03-31 11:10:56 +02:00
|
|
|
|
|
|
|
Default: ``10``
|
|
|
|
|
2015-04-01 11:49:55 +02:00
|
|
|
The maximum limit for Twisted Reactor thread pool size. This is common
|
|
|
|
multi-purpose thread pool used by various Scrapy components. Threaded
|
|
|
|
DNS Resolver, BlockingFeedStorage, S3FilesStore just to name a few. Increase
|
|
|
|
this value if you're experiencing problems with insufficient blocking IO.
|
2015-03-31 11:10:56 +02:00
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
.. setting:: REDIRECT_MAX_TIMES
|
|
|
|
|
|
|
|
REDIRECT_MAX_TIMES
|
|
|
|
------------------
|
|
|
|
|
|
|
|
Default: ``20``
|
|
|
|
|
2013-01-22 14:52:18 -08:00
|
|
|
Defines the maximum times a request can be redirected. After this maximum the
|
2009-08-18 14:05:15 -03:00
|
|
|
request's response is returned as is. We used Firefox default value for the
|
|
|
|
same task.
|
|
|
|
|
|
|
|
.. setting:: REDIRECT_PRIORITY_ADJUST
|
|
|
|
|
|
|
|
REDIRECT_PRIORITY_ADJUST
|
2010-09-05 04:58:14 -03:00
|
|
|
------------------------
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
Default: ``+2``
|
|
|
|
|
2016-01-26 19:24:11 +01:00
|
|
|
Scope: ``scrapy.downloadermiddlewares.redirect.RedirectMiddleware``
|
|
|
|
|
|
|
|
Adjust redirect request priority relative to original request:
|
|
|
|
|
|
|
|
- **a positive priority adjust (default) means higher priority.**
|
|
|
|
- a negative priority adjust means lower priority.
|
|
|
|
|
|
|
|
.. setting:: RETRY_PRIORITY_ADJUST
|
|
|
|
|
|
|
|
RETRY_PRIORITY_ADJUST
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
Default: ``-1``
|
|
|
|
|
|
|
|
Scope: ``scrapy.downloadermiddlewares.retry.RetryMiddleware``
|
|
|
|
|
|
|
|
Adjust retry request priority relative to original request:
|
|
|
|
|
|
|
|
- a positive priority adjust means higher priority.
|
|
|
|
- **a negative priority adjust (default) means lower priority.**
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
.. setting:: ROBOTSTXT_OBEY
|
|
|
|
|
|
|
|
ROBOTSTXT_OBEY
|
|
|
|
--------------
|
|
|
|
|
|
|
|
Default: ``False``
|
|
|
|
|
2015-04-20 21:23:05 -03:00
|
|
|
Scope: ``scrapy.downloadermiddlewares.robotstxt``
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
If enabled, Scrapy will respect robots.txt policies. For more information see
|
2016-01-26 17:47:46 +05:00
|
|
|
:ref:`topics-dlmw-robots`.
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
|
|
|
|
While the default value is ``False`` for historical reasons,
|
|
|
|
this option is enabled by default in settings.py file generated
|
|
|
|
by ``scrapy startproject`` command.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
.. setting:: SCHEDULER
|
|
|
|
|
|
|
|
SCHEDULER
|
|
|
|
---------
|
|
|
|
|
|
|
|
Default: ``'scrapy.core.scheduler.Scheduler'``
|
|
|
|
|
|
|
|
The scheduler to use for crawling.
|
|
|
|
|
2013-09-25 15:13:17 -03:00
|
|
|
.. setting:: SPIDER_CONTRACTS
|
2012-09-10 23:17:27 +02:00
|
|
|
|
|
|
|
SPIDER_CONTRACTS
|
|
|
|
----------------
|
|
|
|
|
2015-11-06 01:14:49 +01:00
|
|
|
Default:: ``{}``
|
|
|
|
|
|
|
|
A dict containing the spider contracts enabled in your project, used for
|
|
|
|
testing spiders. For more info see :ref:`topics-contracts`.
|
|
|
|
|
|
|
|
.. setting:: SPIDER_CONTRACTS_BASE
|
|
|
|
|
|
|
|
SPIDER_CONTRACTS_BASE
|
|
|
|
---------------------
|
|
|
|
|
2012-09-10 23:17:27 +02:00
|
|
|
Default::
|
|
|
|
|
|
|
|
{
|
|
|
|
'scrapy.contracts.default.UrlContract' : 1,
|
|
|
|
'scrapy.contracts.default.ReturnsContract': 2,
|
|
|
|
'scrapy.contracts.default.ScrapesContract': 3,
|
|
|
|
}
|
|
|
|
|
2015-11-06 01:14:49 +01:00
|
|
|
A dict containing the scrapy contracts enabled by default in Scrapy. You should
|
|
|
|
never modify this setting in your project, modify :setting:`SPIDER_CONTRACTS`
|
|
|
|
instead. For more info see :ref:`topics-contracts`.
|
|
|
|
|
|
|
|
You can disable any of these contracts by assigning ``None`` to their class
|
|
|
|
path in :setting:`SPIDER_CONTRACTS`. E.g., to disable the built-in
|
|
|
|
``ScrapesContract``, place this in your ``settings.py``::
|
2015-06-19 15:09:36 +02:00
|
|
|
|
2015-11-06 01:14:49 +01:00
|
|
|
SPIDER_CONTRACTS = {
|
|
|
|
'scrapy.contracts.default.ScrapesContract': None,
|
|
|
|
}
|
2012-09-10 23:17:27 +02:00
|
|
|
|
2015-04-16 20:07:53 +05:00
|
|
|
.. setting:: SPIDER_LOADER_CLASS
|
2014-07-17 10:25:07 -03:00
|
|
|
|
2015-04-16 20:07:53 +05:00
|
|
|
SPIDER_LOADER_CLASS
|
|
|
|
-------------------
|
2014-07-17 10:25:07 -03:00
|
|
|
|
2015-04-16 20:07:53 +05:00
|
|
|
Default: ``'scrapy.spiderloader.SpiderLoader'``
|
2014-07-17 10:25:07 -03:00
|
|
|
|
2015-04-16 20:07:53 +05:00
|
|
|
The class that will be used for loading spiders, which must implement the
|
|
|
|
:ref:`topics-api-spiderloader`.
|
2014-07-17 10:25:07 -03:00
|
|
|
|
2013-09-25 15:13:17 -03:00
|
|
|
.. setting:: SPIDER_MIDDLEWARES
|
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
SPIDER_MIDDLEWARES
|
|
|
|
------------------
|
|
|
|
|
2015-11-06 01:14:49 +01:00
|
|
|
Default:: ``{}``
|
|
|
|
|
|
|
|
A dict containing the spider middlewares enabled in your project, and their
|
|
|
|
orders. For more info see :ref:`topics-spider-middleware-setting`.
|
|
|
|
|
|
|
|
.. setting:: SPIDER_MIDDLEWARES_BASE
|
|
|
|
|
|
|
|
SPIDER_MIDDLEWARES_BASE
|
|
|
|
-----------------------
|
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
Default::
|
|
|
|
|
|
|
|
{
|
2015-04-21 13:07:24 -03:00
|
|
|
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
|
|
|
|
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
|
|
|
|
'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
|
|
|
|
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
|
|
|
|
'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
|
2009-08-18 14:05:15 -03:00
|
|
|
}
|
|
|
|
|
2015-11-06 01:14:49 +01:00
|
|
|
A dict containing the spider middlewares enabled by default in Scrapy, and
|
|
|
|
their orders. Low orders are closer to the engine, high orders are closer to
|
|
|
|
the spider. For more info see :ref:`topics-spider-middleware-setting`.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
.. setting:: SPIDER_MODULES
|
|
|
|
|
|
|
|
SPIDER_MODULES
|
|
|
|
--------------
|
|
|
|
|
|
|
|
Default: ``[]``
|
|
|
|
|
|
|
|
A list of modules where Scrapy will look for spiders.
|
|
|
|
|
|
|
|
Example::
|
|
|
|
|
|
|
|
SPIDER_MODULES = ['mybot.spiders_prod', 'mybot.spiders_dev']
|
|
|
|
|
|
|
|
.. setting:: STATS_CLASS
|
|
|
|
|
|
|
|
STATS_CLASS
|
|
|
|
-----------
|
|
|
|
|
2015-04-23 13:07:48 -03:00
|
|
|
Default: ``'scrapy.statscollectors.MemoryStatsCollector'``
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2012-08-28 18:31:03 -03:00
|
|
|
The class to use for collecting stats, who must implement the
|
|
|
|
:ref:`topics-api-stats`.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
.. setting:: STATS_DUMP
|
|
|
|
|
|
|
|
STATS_DUMP
|
|
|
|
----------
|
|
|
|
|
2011-05-17 22:42:05 -03:00
|
|
|
Default: ``True``
|
|
|
|
|
2012-11-03 17:05:01 -02:00
|
|
|
Dump the :ref:`Scrapy stats <topics-stats>` (to the Scrapy log) once the spider
|
|
|
|
finishes.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2011-05-17 22:42:05 -03:00
|
|
|
For more info see: :ref:`topics-stats`.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
.. setting:: STATSMAILER_RCPTS
|
|
|
|
|
|
|
|
STATSMAILER_RCPTS
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
Default: ``[]`` (empty list)
|
|
|
|
|
2011-05-18 14:46:20 -03:00
|
|
|
Send Scrapy stats after spiders finish scraping. See
|
2015-04-21 13:48:28 -03:00
|
|
|
:class:`~scrapy.extensions.statsmailer.StatsMailer` for more info.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
.. setting:: TELNETCONSOLE_ENABLED
|
|
|
|
|
|
|
|
TELNETCONSOLE_ENABLED
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
Default: ``True``
|
|
|
|
|
2010-08-12 20:45:11 -03:00
|
|
|
A boolean which specifies if the :ref:`telnet console <topics-telnetconsole>`
|
|
|
|
will be enabled (provided its extension is also enabled).
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
.. setting:: TELNETCONSOLE_PORT
|
|
|
|
|
|
|
|
TELNETCONSOLE_PORT
|
|
|
|
------------------
|
|
|
|
|
2010-09-05 06:48:08 -03:00
|
|
|
Default: ``[6023, 6073]``
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2010-09-05 06:48:08 -03:00
|
|
|
The port range to use for the telnet console. If set to ``None`` or ``0``, a
|
2009-08-18 14:05:15 -03:00
|
|
|
dynamically assigned port is used. For more info see
|
|
|
|
:ref:`topics-telnetconsole`.
|
|
|
|
|
|
|
|
.. setting:: TEMPLATES_DIR
|
|
|
|
|
|
|
|
TEMPLATES_DIR
|
|
|
|
-------------
|
|
|
|
|
|
|
|
Default: ``templates`` dir inside scrapy module
|
|
|
|
|
2010-08-19 00:04:52 -03:00
|
|
|
The directory where to look for templates when creating new projects with
|
2016-01-26 17:47:46 +05:00
|
|
|
:command:`startproject` command and new spiders with :command:`genspider`
|
2015-10-31 16:19:11 +01:00
|
|
|
command.
|
|
|
|
|
|
|
|
The project name must not conflict with the name of custom files or directories
|
|
|
|
in the ``project`` subdirectory.
|
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
.. setting:: URLLENGTH_LIMIT
|
|
|
|
|
|
|
|
URLLENGTH_LIMIT
|
|
|
|
---------------
|
|
|
|
|
|
|
|
Default: ``2083``
|
|
|
|
|
2015-04-21 13:07:24 -03:00
|
|
|
Scope: ``spidermiddlewares.urllength``
|
2009-08-18 14:05:15 -03:00
|
|
|
|
|
|
|
The maximum URL length to allow for crawled URLs. For more information about
|
|
|
|
the default value for this setting see: http://www.boutell.com/newfaq/misc/urllength.html
|
|
|
|
|
|
|
|
.. setting:: USER_AGENT
|
|
|
|
|
|
|
|
USER_AGENT
|
|
|
|
----------
|
|
|
|
|
2012-09-19 01:46:46 -03:00
|
|
|
Default: ``"Scrapy/VERSION (+http://scrapy.org)"``
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2014-01-21 18:25:17 +01:00
|
|
|
The default User-Agent to use when crawling, unless overridden.
|
2009-08-18 14:05:15 -03:00
|
|
|
|
2015-05-09 16:15:06 -03:00
|
|
|
|
|
|
|
Settings documented elsewhere:
|
|
|
|
------------------------------
|
|
|
|
|
|
|
|
The following settings are documented elsewhere, please check each specific
|
|
|
|
case to see how to enable and use them.
|
|
|
|
|
|
|
|
.. settingslist::
|
|
|
|
|
|
|
|
|
2010-08-17 14:27:48 -03:00
|
|
|
.. _Amazon web services: http://aws.amazon.com/
|
2011-08-02 11:57:55 -03:00
|
|
|
.. _breadth-first order: http://en.wikipedia.org/wiki/Breadth-first_search
|
|
|
|
.. _depth-first order: http://en.wikipedia.org/wiki/Depth-first_search
|