1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-24 06:43:43 +00:00

Refactored HttpCache middleware:

* simplified code
* performance improvements
* removed awkward/unused domain sectorization
* it can now receive Settings on constructor
* added unittests
* added documentation about filesystem storage structure

Also made scrapy.conf.Settings objects instantiable with a dict which is used to override default settings.
This commit is contained in:
Pablo Hoffman 2009-11-13 14:25:47 -02:00
parent db7fec1fef
commit 564abd10ad
6 changed files with 240 additions and 189 deletions

View File

@ -214,25 +214,86 @@ HttpCacheMiddleware
.. class:: HttpCacheMiddleware
This middleware provides low-level cache to all HTTP requests and responses.
Every request and its corresponding response are cached and then, when that
same request is seen again, the response is returned without transferring
Every request and its corresponding response are cached. When the same
request is seen again, the response is returned without transferring
anything from the Internet.
The HTTP cache is useful for testing spiders faster (without having to wait for
downloads every time) and for trying your spider off-line when you don't have
an Internet connection.
The :class:`HttpCacheMiddleware` can be configured through the following
settings (see the settings documentation for more info):
File system storage
~~~~~~~~~~~~~~~~~~~
By default, the :class:`HttpCacheMiddleware` uses a file system storage with the following structure:
Each request/response pair is stored in a different directory containing with
the following files:
* ``request_body`` - the plain request body
* ``request_headers`` - the request headers (in raw HTTP format)
* ``response_body`` - the plain response body
* ``response_headers`` - the request headers (in raw HTTP format)
* ``meta`` - some metadata of this cache resource in Python ``repr()`` format
(for easy grepeability)
* ``pickled_meta`` - the same metadata in ``meta`` but pickled for more
efficient deserialization
The directory name is made from the request fingerprint (see
``scrapy.utils.request.fingerprint``), and one level of subdirectories is
used to avoid creating too many files into the same directory (which is
inefficient in many file systems). An example directory could be::
/path/to/cache/dir/example.com/72/72811f648e718090f041317756c03adb0ada46c7
The cache storage backend can be changed with the :setting:`HTTPCACHE_STORAGE`
setting, but no other backend is provided with Scrapy yet.
Settings
~~~~~~~~
The :class:`HttpCacheMiddleware` can be configured through the following
settings:
.. setting:: HTTPCACHE_DIR
HTTPCACHE_DIR
^^^^^^^^^^^^^
Default: ``''`` (empty string)
The directory to use for storing the (low-level) HTTP cache. If empty the HTTP
cache will be disabled.
.. setting:: HTTPCACHE_EXPIRATION_SECS
HTTPCACHE_EXPIRATION_SECS
^^^^^^^^^^^^^^^^^^^^^^^^^
Default: ``0``
Number of seconds to use for HTTP cache expiration. Requests that were cached
before this time will be re-downloaded. If zero, cached requests will always
expire. Negative numbers means requests will never expire.
.. setting:: HTTPCACHE_IGNORE_MISSING
HTTPCACHE_IGNORE_MISSING
^^^^^^^^^^^^^^^^^^^^^^^^
Default: ``False``
If enabled, requests not found in the cache will be ignored instead of downloaded.
.. setting:: HTTPCACHE_STORAGE
HTTPCACHE_STORAGE
^^^^^^^^^^^^^^^^^
Default: ``'scrapy.contrib.downloadermiddleware.httpcache.FilesystemCacheStorage'``
The class which implements the cache storage backend.
* :setting:`HTTPCACHE_DIR` - this one actually enables the cache besides
settings the cache dir
* :setting:`HTTPCACHE_IGNORE_MISSING` - ignoring missing requests instead
of downloading them
* :setting:`HTTPCACHE_SECTORIZE` - split HTTP cache in several directories
(for performance reasons)
* :setting:`HTTPCACHE_EXPIRATION_SECS` - how many secs until the cache is
considered out of date
.. _topics-dlmw-robots:

View File

@ -220,45 +220,6 @@ Default: ``1.0``
The version of the bot implemented by this Scrapy project. This will be used to
construct the User-Agent by default.
.. setting:: HTTPCACHE_DIR
HTTPCACHE_DIR
-------------
Default: ``''`` (empty string)
The directory to use for storing the (low-level) HTTP cache. If empty the HTTP
cache will be disabled.
.. setting:: HTTPCACHE_EXPIRATION_SECS
HTTPCACHE_EXPIRATION_SECS
-------------------------
Default: ``0``
Number of seconds to use for HTTP cache expiration. Requests that were cached
before this time will be re-downloaded. If zero, cached requests will always
expire. Negative numbers means requests will never expire.
.. setting:: HTTPCACHE_IGNORE_MISSING
HTTPCACHE_IGNORE_MISSING
------------------------
Default: ``False``
If enabled, requests not found in the cache will be ignored instead of downloaded.
.. setting:: HTTPCACHE_SECTORIZE
HTTPCACHE_SECTORIZE
-------------------
Default: ``True``
Whether to split HTTP cache storage in several dirs for performance.
.. setting:: COMMANDS_MODULE
COMMANDS_MODULE

View File

@ -13,7 +13,7 @@ import_ = lambda x: __import__(x, {}, {}, [''])
class Settings(object):
def __init__(self):
def __init__(self, overrides=None):
self.defaults = {}
self.global_defaults = default_settings
self.disabled = os.environ.get('SCRAPY_SETTINGS_DISABLED', False)
@ -24,6 +24,8 @@ class Settings(object):
# XXX: find a better solution for this hack
pickled_settings = os.environ.get("SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE")
self.overrides = pickle.loads(pickled_settings) if pickled_settings else {}
if overrides:
self.overrides.update(overrides)
def __getitem__(self, opt_name):
if not self.disabled:

View File

@ -94,7 +94,7 @@ GROUPSETTINGS_MODULE = ''
HTTPCACHE_DIR = ''
HTTPCACHE_IGNORE_MISSING = False
HTTPCACHE_SECTORIZE = True
HTTPCACHE_STORAGE = 'scrapy.contrib.downloadermiddleware.httpcache.FilesystemCacheStorage'
HTTPCACHE_EXPIRATION_SECS = 0
ITEM_PROCESSOR = 'scrapy.contrib.pipeline.ItemPipelineManager'

View File

@ -1,181 +1,122 @@
from __future__ import with_statement
import errno
import os
import hashlib
import datetime
from os.path import join, exists
from time import time
import cPickle as pickle
from scrapy.xlib.pydispatch import dispatcher
from scrapy.xlib.pydispatch import dispatcher
from scrapy.core import signals
from scrapy import log
from scrapy.http import Headers
from scrapy.core.exceptions import NotConfigured, IgnoreRequest
from scrapy.core.downloader.responsetypes import responsetypes
from scrapy.utils.request import request_fingerprint
from scrapy.utils.http import headers_dict_to_raw, headers_raw_to_dict
from scrapy.utils.httpobj import urlparse_cached
from scrapy.conf import settings
from scrapy.utils.misc import load_object
from scrapy import conf
class HttpCacheMiddleware(object):
def __init__(self):
if not settings['HTTPCACHE_DIR']:
raise NotConfigured
self.cache = Cache(settings['HTTPCACHE_DIR'], sectorize=settings.getbool('HTTPCACHE_SECTORIZE'))
self.ignore_missing = settings.getbool('HTTPCACHE_IGNORE_MISSING')
dispatcher.connect(self.open_domain, signal=signals.spider_opened)
def open_domain(self, spider):
self.cache.open_domain(spider.domain_name)
def __init__(self, settings=conf.settings):
self.storage = load_object(settings['HTTPCACHE_STORAGE'])(settings)
self.ignore_missing = settings.getbool('HTTPCACHE_IGNORE_MISSING')
dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
dispatcher.connect(self.spider_closed, signal=signals.spider_closed)
def spider_opened(self, spider):
self.storage.open_spider(spider)
def spider_closed(self, spider):
self.storage.close_spider(spider)
def process_request(self, request, spider):
if not is_cacheable(request):
if not self.is_cacheable(request):
return
key = request_fingerprint(request)
domain = spider.domain_name
try:
response = self.cache.retrieve_response(domain, key)
except:
log.msg("Corrupt cache for %s" % request.url, log.WARNING)
response = False
response = self.storage.retrieve_response(spider, request)
if response:
response.flags.append('cached')
return response
elif self.ignore_missing:
raise IgnoreRequest("Ignored request not in cache: %s" % request)
def process_response(self, request, response, spider):
if is_cacheable(request):
key = request_fingerprint(request)
self.cache.store(spider.domain_name, key, request, response)
if self.is_cacheable(request):
self.storage.store_response(spider, request, response)
return response
def is_cacheable(request):
return urlparse_cached(request).scheme in ['http', 'https']
def is_cacheable(self, request):
return urlparse_cached(request).scheme in ['http', 'https']
class Cache(object):
DOMAIN_SECTORDIR = 'data'
DOMAIN_LINKDIR = 'domains'
class FilesystemCacheStorage(object):
def __init__(self, cachedir, sectorize=False):
def __init__(self, settings=conf.settings):
cachedir = settings['HTTPCACHE_DIR']
if not cachedir:
raise NotConfigured
self.cachedir = cachedir
self.sectorize = sectorize
self.expiration_secs = settings.getint('HTTPCACHE_EXPIRATION_SECS')
self.baselinkpath = os.path.join(self.cachedir, self.DOMAIN_LINKDIR)
if not os.path.exists(self.baselinkpath):
os.makedirs(self.baselinkpath)
def open_spider(self, spider):
pass
self.basesectorpath = os.path.join(self.cachedir, self.DOMAIN_SECTORDIR)
if not os.path.exists(self.basesectorpath):
os.makedirs(self.basesectorpath)
def close_spider(self, spider):
pass
def domainsectorpath(self, domain):
sector = hashlib.sha1(domain).hexdigest()[0]
return os.path.join(self.basesectorpath, sector, domain)
def domainlinkpath(self, domain):
return os.path.join(self.baselinkpath, domain)
def requestpath(self, domain, key):
linkpath = self.domainlinkpath(domain)
return os.path.join(linkpath, key[0:2], key)
def open_domain(self, domain):
if domain:
linkpath = self.domainlinkpath(domain)
if self.sectorize:
sectorpath = self.domainsectorpath(domain)
if not os.path.exists(sectorpath):
os.makedirs(sectorpath)
if not os.path.exists(linkpath):
try:
os.symlink(sectorpath, linkpath)
except:
os.makedirs(linkpath) # windows filesystem
else:
if not os.path.exists(linkpath):
os.makedirs(linkpath)
def read_meta(self, domain, key):
"""Return the metadata dictionary (possibly empty) if the entry is
cached, None otherwise.
"""
requestpath = self.requestpath(domain, key)
try:
with open(os.path.join(requestpath, 'pickled_meta'), 'r') as f:
metadata = pickle.load(f)
except IOError, e:
if e.errno != errno.ENOENT:
raise
return None
expiration_secs = settings.getint('HTTPCACHE_EXPIRATION_SECS')
if expiration_secs >= 0:
expiration_date = metadata['timestamp'] + datetime.timedelta(seconds=expiration_secs)
if datetime.datetime.utcnow() > expiration_date:
log.msg('dropping old cached response from %s' % metadata['timestamp'], \
level=log.DEBUG, domain=domain)
return None
return metadata
def retrieve_response(self, domain, key):
"""
Return response dictionary if request has correspondent cache record;
return None if not.
"""
metadata = self.read_meta(domain, key)
def retrieve_response(self, spider, request):
"""Return response if present in cache, or None otherwise."""
metadata = self._read_meta(spider, request)
if metadata is None:
return None # not cached
requestpath = self.requestpath(domain, key)
responsebody = responseheaders = None
with open(os.path.join(requestpath, 'response_body')) as f:
responsebody = f.read()
with open(os.path.join(requestpath, 'response_headers')) as f:
responseheaders = f.read()
return # not cached
rpath = self._get_request_path(spider, request)
with open(join(rpath, 'response_body'), 'rb') as f:
body = f.read()
with open(join(rpath, 'response_headers'), 'rb') as f:
rawheaders = f.read()
url = metadata['url']
headers = Headers(headers_raw_to_dict(responseheaders))
status = metadata['status']
headers = Headers(headers_raw_to_dict(rawheaders))
respcls = responsetypes.from_args(headers=headers, url=url)
response = respcls(url=url, headers=headers, status=status, body=responsebody)
response.meta['cached'] = True
response.flags.append('cached')
response = respcls(url=url, headers=headers, status=status, body=body)
return response
def store(self, domain, key, request, response):
requestpath = self.requestpath(domain, key)
if not os.path.exists(requestpath):
os.makedirs(requestpath)
def store_response(self, spider, request, response):
"""Store the given response in the cache."""
rpath = self._get_request_path(spider, request)
if not exists(rpath):
os.makedirs(rpath)
metadata = {
'url':request.url,
'method': request.method,
'status': response.status,
'domain': domain,
'timestamp': datetime.datetime.utcnow(),
}
# metadata
with open(os.path.join(requestpath, 'meta_data'), 'w') as f:
'url': request.url,
'method': request.method,
'status': response.status,
'timestamp': time(),
}
with open(join(rpath, 'meta'), 'wb') as f:
f.write(repr(metadata))
# pickled metadata (to recover without using eval)
with open(os.path.join(requestpath, 'pickled_meta'), 'w') as f:
pickle.dump(metadata, f)
# response
with open(os.path.join(requestpath, 'response_headers'), 'w') as f:
with open(join(rpath, 'pickled_meta'), 'wb') as f:
pickle.dump(metadata, f, protocol=2)
with open(join(rpath, 'response_headers'), 'wb') as f:
f.write(headers_dict_to_raw(response.headers))
with open(os.path.join(requestpath, 'response_body'), 'w') as f:
with open(join(rpath, 'response_body'), 'wb') as f:
f.write(response.body)
# request
with open(os.path.join(requestpath, 'request_headers'), 'w') as f:
with open(join(rpath, 'request_headers'), 'wb') as f:
f.write(headers_dict_to_raw(request.headers))
if request.body:
with open(os.path.join(requestpath, 'request_body'), 'w') as f:
f.write(request.body)
with open(join(rpath, 'request_body'), 'wb') as f:
f.write(request.body)
def _get_request_path(self, spider, request):
key = request_fingerprint(request)
return join(self.cachedir, spider.domain_name, key[0:2], key)
def _read_meta(self, spider, request):
rpath = self._get_request_path(spider, request)
metapath = join(rpath, 'pickled_meta')
if not exists(metapath):
return # not found
mtime = os.stat(rpath).st_mtime
if 0 <= self.expiration_secs < time() - mtime:
return # expired
with open(metapath, 'rb') as f:
return pickle.load(f)

View File

@ -0,0 +1,86 @@
import unittest, tempfile, shutil, time
from scrapy.http import Response, HtmlResponse, Request
from scrapy.spider import BaseSpider
from scrapy.contrib.downloadermiddleware.httpcache import FilesystemCacheStorage, HttpCacheMiddleware
from scrapy.conf import Settings
from scrapy.core.exceptions import IgnoreRequest
class HttpCacheMiddlewareTest(unittest.TestCase):
storage_class = FilesystemCacheStorage
def setUp(self):
self.spider = BaseSpider('example.com')
self.tmpdir = tempfile.mkdtemp()
self.request = Request('http://www.example.com', headers={'User-Agent': 'test'})
self.response = Response('http://www.example.com', headers={'Content-Type': 'text/html'}, body='test body', status=202)
def tearDown(self):
shutil.rmtree(self.tmpdir)
def _get_settings(self, **new_settings):
settings = {
'HTTPCACHE_DIR': self.tmpdir,
'HTTPCACHE_EXPIRATION_SECS': 1,
}
settings.update(new_settings)
return Settings(settings)
def _get_storage(self, **new_settings):
return self.storage_class(self._get_settings(**new_settings))
def _get_middleware(self, **new_settings):
return HttpCacheMiddleware(self._get_settings(**new_settings))
def test_storage(self):
storage = self._get_storage()
request2 = self.request.copy()
assert storage.retrieve_response(self.spider, request2) is None
storage.store_response(self.spider, self.request, self.response)
response2 = storage.retrieve_response(self.spider, request2)
assert isinstance(response2, HtmlResponse) # inferred from content-type header
self.assertEqualResponse(self.response, response2)
time.sleep(2) # wait for cache to expire
assert storage.retrieve_response(self.spider, request2) is None
def test_storage_expire_immediately(self):
storage = self._get_storage(HTTPCACHE_EXPIRATION_SECS=0)
assert storage.retrieve_response(self.spider, self.request) is None
storage.store_response(self.spider, self.request, self.response)
assert storage.retrieve_response(self.spider, self.request) is None
def test_storage_never_expire(self):
storage = self._get_storage(HTTPCACHE_EXPIRATION_SECS=-1)
assert storage.retrieve_response(self.spider, self.request) is None
storage.store_response(self.spider, self.request, self.response)
assert storage.retrieve_response(self.spider, self.request)
def test_middleware(self):
mw = HttpCacheMiddleware(self._get_settings())
assert mw.process_request(self.request, self.spider) is None
mw.process_response(self.request, self.response, self.spider)
response = mw.process_request(self.request, self.spider)
assert isinstance(response, HtmlResponse)
self.assertEqualResponse(self.response, response)
assert 'cached' in response.flags
def test_middleware_ignore_missing(self):
mw = self._get_middleware(HTTPCACHE_IGNORE_MISSING=True)
self.assertRaises(IgnoreRequest, mw.process_request, self.request, self.spider)
mw.process_response(self.request, self.response, self.spider)
response = mw.process_request(self.request, self.spider)
assert isinstance(response, HtmlResponse)
self.assertEqualResponse(self.response, response)
assert 'cached' in response.flags
def assertEqualResponse(self, response1, response2):
self.assertEqual(response1.url, response2.url)
self.assertEqual(response1.status, response2.status)
self.assertEqual(response1.headers, response2.headers)
self.assertEqual(response1.body, response2.body)
if __name__ == '__main__':
unittest.main()