1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-25 19:23:48 +00:00
scrapy/docs/topics/stats.rst

324 lines
11 KiB
ReStructuredText
Raw Normal View History

.. _topics-stats:
===============
Stats Collector
===============
2009-07-15 01:40:44 -03:00
Overview
========
Scrapy provides a convenient service for collecting stats in the form of
key/values, both globally and per spider/domain. It's called the Stats
Collector, and it's a singleton which can be imported and used quickly, as
illustrated by the examples in the :ref:`topics-stats-usecases` section below.
The stats collection is enabled by default but can be disabled through the
:setting:`STATS_ENABLED` setting.
However, the Stats Collector is always available, so you can always import it
in your module and use its API (to increment or set new stat keys), regardless
of whether the stats collection is enabled or not. If it's disabled, the API
will still work but it won't collect anything. This is aimed at simplifying the
stats collector usage: you should spend no more than one line of code for
collecting stats in your spider, Scrapy extension, or whatever code you're
using the Stats Collector from.
Another feature of the Stats Collector is that it's very efficient (when
enabled) and extremely efficient (almost unnoticeable) when disabled.
The Stats Collector keeps one stats table per open spider/domain and one global
stats table. You can't set or get stats from a closed domain, but the
domain-specific stats table is automatically opened when the spider is opened,
and closed when the spider is closed.
.. _topics-stats-usecases:
Common Stats Collector uses
===========================
Import the stats collector::
from scrapy.stats import stats
Set global stat value::
stats.set_value('hostname', socket.gethostname())
Increment global stat value::
stats.inc_value('spiders_crawled')
Set global stat value only if greater than previous::
stats.max_value('max_items_scraped', value)
Set global stat value only if lower than previous::
stats.min_value('min_free_memory_percent', value)
Get global stat value::
>>> stats.get_value('spiders_crawled')
8
Get all global stats from a given domain::
>>> stats.get_stats()
{'hostname': 'localhost', 'spiders_crawled': 8}
Set domain/spider specific stat value (domains must be opened first, but this
is done automatically by the Stats Collector on the :signal:`domain_open`
signal)::
stats.set_value('start_time', datetime.now(), domain='example.com')
Increment domain-specific stat value::
stats.inc_value('pages_crawled', domain='example.com')
Set domain-specific stat value only if greater than previous::
stats.max_value('max_items_scraped', value, domain='example.com')
Set domain-specific stat value only if lower than previous::
stats.min_value('min_free_memory_percent', value, domain='example.com')
Get domain-specific stat value::
>>> stats.get_value('pages_crawled', domain='example.com')
1238
Get all stats from a given domain::
>>> stats.get_stats('pages_crawled', domain='example.com')
{'pages_crawled': 1238, 'start_time': datetime.datetime(2009, 7, 14, 21, 47, 28, 977139)}
2009-08-18 14:05:15 -03:00
.. _topics-stats-ref:
2009-07-15 01:40:44 -03:00
Stats Collector API
===================
There are several Stats Collectors available under the
:mod:`scrapy.stats.collector` module and they all implement the Stats
Collector API defined by the :class:`~scrapy.stats.collector.StatsCollector`
class (which they all inherit from).
.. module:: scrapy.stats.collector
:synopsis: Basic Stats Collectors
.. class:: StatsCollector
.. method:: get_value(key, default=None, domain=None)
Return the value for the given stats key or default if it doesn't exist.
If domain is ``None`` the global stats table is consulted, other the
domain specific one is. If the domain is not yet opened a ``KeyError``
exception is raised.
.. method:: get_stats(domain=None)
Get all stats from the given domain/spider (if domain is given) or all
global stats otherwise, as a dict. If domain is not opened ``KeyError``
is raied.
.. method:: set_value(key, value, domain=None)
Set the given value for the given stats key on the global stats (if
domain is not given) or the domain-specific stats (if domain is given),
which must be opened or a ``KeyError`` will be raised.
.. method:: set_stats(stats, domain=None)
Set the given stats (as a dict) for the given domain. If the domain is
not opened a ``KeyError`` will be raised.
.. method:: inc_value(key, count=1, start=0, domain=None)
Increment the value of the given stats key, by the given count,
assuming the start value given (when it's not set). If domain is not
given the global stats table is used, otherwise the domain-specific
stats table is used, which must be opened or a ``KeyError`` will be
raised.
.. method:: max_value(key, value, domain=None)
Set the given value for the given key only if current value for the
same key is lower than value. If there is no current value for the
given key, the value is always set. If domain is not given the global
stats table is used, otherwise the domain-specific stats table is used,
which must be opened or a KeyError will be raised.
.. method:: min_value(key, value, domain=None)
Set the given value for the given key only if current value for the
same key is greater than value. If there is no current value for the
given key, the value is always set. If domain is not given the global
stats table is used, otherwise the domain-specific stats table is used,
which must be opened or a KeyError will be raised.
.. method:: clear_stats(domain=None)
Clear all global stats (if domain is not given) or all domain-specific
stats if domain is given, in which case it must be opened or a
``KeyError`` will be raised.
.. method:: list_domains()
Return a list of all opened domains.
.. method:: open_domain(domain)
Open the given domain for stats collection. This method must be called
prior to working with any stats specific to that domain, but it's
called automatically when the :signal:`domain_open` signal is received.
.. method:: close_domain(domain)
Close the given domain. After this is called, no more specific stats
for this domain can be accessed. This method is called automatically on
the :signal:`domain_closed` signal.
Available Stats Collectors
==========================
Besides the basic :class:`StatsCollector` there are other Stats Collectors
available in Scrapy which extend the basic Stats Collector. You can select
which Stats Collector to use through the :setting:`STATS_CLASS` setting. The
default Stats Collector is the :class:`MemoryStatsCollector` is used.
When stats are disabled (through the :setting:`STATS_ENABLED` setting) the
:setting:`STATS_CLASS` setting is ignored and the :class:`DummyStatsCollector`
is used.
MemoryStatsCollector
--------------------
.. class:: MemoryStatsCollector
A simple stats collector that keeps the stats of the last scraping run (for
each domain) in memory, which can be accessed through the ``domain_stats``
attribute
This is the default Stats Collector used in Scrapy.
.. attribute:: domain_stats
A dict of dicts (keyed by domain) containing the stats of the last
scraping run for each domain.
DummyStatsCollector
-------------------
.. class:: DummyStatsCollector
A Stats collector which does nothing but is very efficient. This is the
Stats Collector used when stats are diabled (through the
:setting:`STATS_ENABLED` setting).
SimpledbStatsCollector
----------------------
.. module:: scrapy.stats.collector.simpledb
:synopsis: Simpledb Stats Collector
.. class:: SimpledbStatsCollector
A Stats collector which persists stats to `Amazon SimpleDB`_, using one
SimpleDB item per scraping run (ie. it keeps history of all scraping runs).
The data is persisted to the SimpleDB domain specified by the
:setting:`STATS_SDB_DOMAIN` setting. The domain will be created if it
doesn't exist.
In addition to the existing stats keys the following keys are added at
persitance time:
* ``domain``: the spider domain (so you can use it later for querying stats
for that domain)
* ``timestamp``: the timestamp when the stats were persisited
Both the ``domain`` and ``timestamp`` are used for generating the SimpleDB
item name in order to avoid overwriting stats of previous scraping runs.
As `required by SimpleDB`_, datetime's are stored in ISO 8601 format and
numbers are zero-padded to 16 digits. Negative numbers are not currently
supported.
This Stats Collector requires the `boto`_ library.
.. _Amazon SimpleDB: http://aws.amazon.com/simpledb/
.. _required by SimpleDB: http://docs.amazonwebservices.com/AmazonSimpleDB/2009-04-15/DeveloperGuide/ZeroPadding.html
.. _boto: http://code.google.com/p/boto/
This Stats Collector can be configured through the following settings:
.. setting:: STATS_SDB_DOMAIN
STATS_SDB_DOMAIN
~~~~~~~~~~~~~~~~
Default: ``'scrapy_stats'``
A string containing the SimpleDB domain to use in the
:class:`SimpledbStatsCollector`.
.. setting:: STATS_SDB_ASYNC
STATS_SDB_ASYNC
~~~~~~~~~~~~~~~
Default: ``False``
If ``True`` communication with SimpleDB will be performed asynchronously. If
``False`` blocking IO will be used instead. This is the default as using
asynchronous communication can result in the stats not being persisted if the
Scrapy engine is shut down in the middle (for example, when you run only one
spider in a process and then exit).
Stats signals
=============
The Stats Collector provides some signals for extending the stats collection
functionality:
.. module:: scrapy.stats.signals
:synopsis: Stats Collector signals
.. signal:: stats_domain_opened
.. function:: stats_domain_opened(domain)
Sent right after the stats domain is opened. You can use this signal to add
startup stats for domain (example: start time).
:param domain: the stats domain just opened
:type domain: str
.. signal:: stats_domain_closing
.. function:: stats_domain_closing(domain, reason)
Sent just before the stats domain is closed. You can use this signal to add
some closing stats (example: finish time).
:param domain: the stats domain about to be closed
:type domain: str
:param reason: the reason why the domain is being closed. See
:signal:`domain_closed` signal for more info.
:type reason: str
.. signal:: stats_domain_closed
.. function:: stats_domain_closed(domain, reason, domain_stats)
Sent right after the stats domain is closed. You can use this signal to
collect resources.
:param domain: the stats domain just closed
:type domain: str
:param reason: the reason why the domain was closed. See
:signal:`domain_closed` signal for more info.
:type reason: str
:param domain_stats: the stats of the domain just closed.
:type reason: dict