mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-24 06:43:43 +00:00
Refactored HttpCache middleware:
* simplified code * performance improvements * removed awkward/unused domain sectorization * it can now receive Settings on constructor * added unittests * added documentation about filesystem storage structure Also made scrapy.conf.Settings objects instantiable with a dict which is used to override default settings.
This commit is contained in:
parent
db7fec1fef
commit
564abd10ad
@ -214,25 +214,86 @@ HttpCacheMiddleware
|
||||
.. class:: HttpCacheMiddleware
|
||||
|
||||
This middleware provides low-level cache to all HTTP requests and responses.
|
||||
Every request and its corresponding response are cached and then, when that
|
||||
same request is seen again, the response is returned without transferring
|
||||
Every request and its corresponding response are cached. When the same
|
||||
request is seen again, the response is returned without transferring
|
||||
anything from the Internet.
|
||||
|
||||
The HTTP cache is useful for testing spiders faster (without having to wait for
|
||||
downloads every time) and for trying your spider off-line when you don't have
|
||||
an Internet connection.
|
||||
|
||||
The :class:`HttpCacheMiddleware` can be configured through the following
|
||||
settings (see the settings documentation for more info):
|
||||
File system storage
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
By default, the :class:`HttpCacheMiddleware` uses a file system storage with the following structure:
|
||||
|
||||
Each request/response pair is stored in a different directory containing with
|
||||
the following files:
|
||||
|
||||
* ``request_body`` - the plain request body
|
||||
* ``request_headers`` - the request headers (in raw HTTP format)
|
||||
* ``response_body`` - the plain response body
|
||||
* ``response_headers`` - the request headers (in raw HTTP format)
|
||||
* ``meta`` - some metadata of this cache resource in Python ``repr()`` format
|
||||
(for easy grepeability)
|
||||
* ``pickled_meta`` - the same metadata in ``meta`` but pickled for more
|
||||
efficient deserialization
|
||||
|
||||
The directory name is made from the request fingerprint (see
|
||||
``scrapy.utils.request.fingerprint``), and one level of subdirectories is
|
||||
used to avoid creating too many files into the same directory (which is
|
||||
inefficient in many file systems). An example directory could be::
|
||||
|
||||
/path/to/cache/dir/example.com/72/72811f648e718090f041317756c03adb0ada46c7
|
||||
|
||||
The cache storage backend can be changed with the :setting:`HTTPCACHE_STORAGE`
|
||||
setting, but no other backend is provided with Scrapy yet.
|
||||
|
||||
Settings
|
||||
~~~~~~~~
|
||||
|
||||
The :class:`HttpCacheMiddleware` can be configured through the following
|
||||
settings:
|
||||
|
||||
.. setting:: HTTPCACHE_DIR
|
||||
|
||||
HTTPCACHE_DIR
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
Default: ``''`` (empty string)
|
||||
|
||||
The directory to use for storing the (low-level) HTTP cache. If empty the HTTP
|
||||
cache will be disabled.
|
||||
|
||||
.. setting:: HTTPCACHE_EXPIRATION_SECS
|
||||
|
||||
HTTPCACHE_EXPIRATION_SECS
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Default: ``0``
|
||||
|
||||
Number of seconds to use for HTTP cache expiration. Requests that were cached
|
||||
before this time will be re-downloaded. If zero, cached requests will always
|
||||
expire. Negative numbers means requests will never expire.
|
||||
|
||||
.. setting:: HTTPCACHE_IGNORE_MISSING
|
||||
|
||||
HTTPCACHE_IGNORE_MISSING
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Default: ``False``
|
||||
|
||||
If enabled, requests not found in the cache will be ignored instead of downloaded.
|
||||
|
||||
.. setting:: HTTPCACHE_STORAGE
|
||||
|
||||
HTTPCACHE_STORAGE
|
||||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
Default: ``'scrapy.contrib.downloadermiddleware.httpcache.FilesystemCacheStorage'``
|
||||
|
||||
The class which implements the cache storage backend.
|
||||
|
||||
* :setting:`HTTPCACHE_DIR` - this one actually enables the cache besides
|
||||
settings the cache dir
|
||||
* :setting:`HTTPCACHE_IGNORE_MISSING` - ignoring missing requests instead
|
||||
of downloading them
|
||||
* :setting:`HTTPCACHE_SECTORIZE` - split HTTP cache in several directories
|
||||
(for performance reasons)
|
||||
* :setting:`HTTPCACHE_EXPIRATION_SECS` - how many secs until the cache is
|
||||
considered out of date
|
||||
|
||||
.. _topics-dlmw-robots:
|
||||
|
||||
|
@ -220,45 +220,6 @@ Default: ``1.0``
|
||||
The version of the bot implemented by this Scrapy project. This will be used to
|
||||
construct the User-Agent by default.
|
||||
|
||||
.. setting:: HTTPCACHE_DIR
|
||||
|
||||
HTTPCACHE_DIR
|
||||
-------------
|
||||
|
||||
Default: ``''`` (empty string)
|
||||
|
||||
The directory to use for storing the (low-level) HTTP cache. If empty the HTTP
|
||||
cache will be disabled.
|
||||
|
||||
.. setting:: HTTPCACHE_EXPIRATION_SECS
|
||||
|
||||
HTTPCACHE_EXPIRATION_SECS
|
||||
-------------------------
|
||||
|
||||
Default: ``0``
|
||||
|
||||
Number of seconds to use for HTTP cache expiration. Requests that were cached
|
||||
before this time will be re-downloaded. If zero, cached requests will always
|
||||
expire. Negative numbers means requests will never expire.
|
||||
|
||||
.. setting:: HTTPCACHE_IGNORE_MISSING
|
||||
|
||||
HTTPCACHE_IGNORE_MISSING
|
||||
------------------------
|
||||
|
||||
Default: ``False``
|
||||
|
||||
If enabled, requests not found in the cache will be ignored instead of downloaded.
|
||||
|
||||
.. setting:: HTTPCACHE_SECTORIZE
|
||||
|
||||
HTTPCACHE_SECTORIZE
|
||||
-------------------
|
||||
|
||||
Default: ``True``
|
||||
|
||||
Whether to split HTTP cache storage in several dirs for performance.
|
||||
|
||||
.. setting:: COMMANDS_MODULE
|
||||
|
||||
COMMANDS_MODULE
|
||||
|
@ -13,7 +13,7 @@ import_ = lambda x: __import__(x, {}, {}, [''])
|
||||
|
||||
class Settings(object):
|
||||
|
||||
def __init__(self):
|
||||
def __init__(self, overrides=None):
|
||||
self.defaults = {}
|
||||
self.global_defaults = default_settings
|
||||
self.disabled = os.environ.get('SCRAPY_SETTINGS_DISABLED', False)
|
||||
@ -24,6 +24,8 @@ class Settings(object):
|
||||
# XXX: find a better solution for this hack
|
||||
pickled_settings = os.environ.get("SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE")
|
||||
self.overrides = pickle.loads(pickled_settings) if pickled_settings else {}
|
||||
if overrides:
|
||||
self.overrides.update(overrides)
|
||||
|
||||
def __getitem__(self, opt_name):
|
||||
if not self.disabled:
|
||||
|
@ -94,7 +94,7 @@ GROUPSETTINGS_MODULE = ''
|
||||
|
||||
HTTPCACHE_DIR = ''
|
||||
HTTPCACHE_IGNORE_MISSING = False
|
||||
HTTPCACHE_SECTORIZE = True
|
||||
HTTPCACHE_STORAGE = 'scrapy.contrib.downloadermiddleware.httpcache.FilesystemCacheStorage'
|
||||
HTTPCACHE_EXPIRATION_SECS = 0
|
||||
|
||||
ITEM_PROCESSOR = 'scrapy.contrib.pipeline.ItemPipelineManager'
|
||||
|
@ -1,181 +1,122 @@
|
||||
from __future__ import with_statement
|
||||
|
||||
import errno
|
||||
import os
|
||||
import hashlib
|
||||
import datetime
|
||||
from os.path import join, exists
|
||||
from time import time
|
||||
import cPickle as pickle
|
||||
from scrapy.xlib.pydispatch import dispatcher
|
||||
|
||||
from scrapy.xlib.pydispatch import dispatcher
|
||||
from scrapy.core import signals
|
||||
from scrapy import log
|
||||
from scrapy.http import Headers
|
||||
from scrapy.core.exceptions import NotConfigured, IgnoreRequest
|
||||
from scrapy.core.downloader.responsetypes import responsetypes
|
||||
from scrapy.utils.request import request_fingerprint
|
||||
from scrapy.utils.http import headers_dict_to_raw, headers_raw_to_dict
|
||||
from scrapy.utils.httpobj import urlparse_cached
|
||||
from scrapy.conf import settings
|
||||
from scrapy.utils.misc import load_object
|
||||
from scrapy import conf
|
||||
|
||||
|
||||
class HttpCacheMiddleware(object):
|
||||
def __init__(self):
|
||||
if not settings['HTTPCACHE_DIR']:
|
||||
raise NotConfigured
|
||||
self.cache = Cache(settings['HTTPCACHE_DIR'], sectorize=settings.getbool('HTTPCACHE_SECTORIZE'))
|
||||
self.ignore_missing = settings.getbool('HTTPCACHE_IGNORE_MISSING')
|
||||
dispatcher.connect(self.open_domain, signal=signals.spider_opened)
|
||||
|
||||
def open_domain(self, spider):
|
||||
self.cache.open_domain(spider.domain_name)
|
||||
def __init__(self, settings=conf.settings):
|
||||
self.storage = load_object(settings['HTTPCACHE_STORAGE'])(settings)
|
||||
self.ignore_missing = settings.getbool('HTTPCACHE_IGNORE_MISSING')
|
||||
dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
|
||||
dispatcher.connect(self.spider_closed, signal=signals.spider_closed)
|
||||
|
||||
def spider_opened(self, spider):
|
||||
self.storage.open_spider(spider)
|
||||
|
||||
def spider_closed(self, spider):
|
||||
self.storage.close_spider(spider)
|
||||
|
||||
def process_request(self, request, spider):
|
||||
if not is_cacheable(request):
|
||||
if not self.is_cacheable(request):
|
||||
return
|
||||
|
||||
key = request_fingerprint(request)
|
||||
domain = spider.domain_name
|
||||
|
||||
try:
|
||||
response = self.cache.retrieve_response(domain, key)
|
||||
except:
|
||||
log.msg("Corrupt cache for %s" % request.url, log.WARNING)
|
||||
response = False
|
||||
|
||||
response = self.storage.retrieve_response(spider, request)
|
||||
if response:
|
||||
response.flags.append('cached')
|
||||
return response
|
||||
elif self.ignore_missing:
|
||||
raise IgnoreRequest("Ignored request not in cache: %s" % request)
|
||||
|
||||
def process_response(self, request, response, spider):
|
||||
if is_cacheable(request):
|
||||
key = request_fingerprint(request)
|
||||
self.cache.store(spider.domain_name, key, request, response)
|
||||
|
||||
if self.is_cacheable(request):
|
||||
self.storage.store_response(spider, request, response)
|
||||
return response
|
||||
|
||||
|
||||
def is_cacheable(request):
|
||||
return urlparse_cached(request).scheme in ['http', 'https']
|
||||
def is_cacheable(self, request):
|
||||
return urlparse_cached(request).scheme in ['http', 'https']
|
||||
|
||||
|
||||
class Cache(object):
|
||||
DOMAIN_SECTORDIR = 'data'
|
||||
DOMAIN_LINKDIR = 'domains'
|
||||
class FilesystemCacheStorage(object):
|
||||
|
||||
def __init__(self, cachedir, sectorize=False):
|
||||
def __init__(self, settings=conf.settings):
|
||||
cachedir = settings['HTTPCACHE_DIR']
|
||||
if not cachedir:
|
||||
raise NotConfigured
|
||||
self.cachedir = cachedir
|
||||
self.sectorize = sectorize
|
||||
self.expiration_secs = settings.getint('HTTPCACHE_EXPIRATION_SECS')
|
||||
|
||||
self.baselinkpath = os.path.join(self.cachedir, self.DOMAIN_LINKDIR)
|
||||
if not os.path.exists(self.baselinkpath):
|
||||
os.makedirs(self.baselinkpath)
|
||||
def open_spider(self, spider):
|
||||
pass
|
||||
|
||||
self.basesectorpath = os.path.join(self.cachedir, self.DOMAIN_SECTORDIR)
|
||||
if not os.path.exists(self.basesectorpath):
|
||||
os.makedirs(self.basesectorpath)
|
||||
def close_spider(self, spider):
|
||||
pass
|
||||
|
||||
def domainsectorpath(self, domain):
|
||||
sector = hashlib.sha1(domain).hexdigest()[0]
|
||||
return os.path.join(self.basesectorpath, sector, domain)
|
||||
|
||||
def domainlinkpath(self, domain):
|
||||
return os.path.join(self.baselinkpath, domain)
|
||||
|
||||
def requestpath(self, domain, key):
|
||||
linkpath = self.domainlinkpath(domain)
|
||||
return os.path.join(linkpath, key[0:2], key)
|
||||
|
||||
def open_domain(self, domain):
|
||||
if domain:
|
||||
linkpath = self.domainlinkpath(domain)
|
||||
if self.sectorize:
|
||||
sectorpath = self.domainsectorpath(domain)
|
||||
if not os.path.exists(sectorpath):
|
||||
os.makedirs(sectorpath)
|
||||
if not os.path.exists(linkpath):
|
||||
try:
|
||||
os.symlink(sectorpath, linkpath)
|
||||
except:
|
||||
os.makedirs(linkpath) # windows filesystem
|
||||
else:
|
||||
if not os.path.exists(linkpath):
|
||||
os.makedirs(linkpath)
|
||||
|
||||
def read_meta(self, domain, key):
|
||||
"""Return the metadata dictionary (possibly empty) if the entry is
|
||||
cached, None otherwise.
|
||||
"""
|
||||
requestpath = self.requestpath(domain, key)
|
||||
try:
|
||||
with open(os.path.join(requestpath, 'pickled_meta'), 'r') as f:
|
||||
metadata = pickle.load(f)
|
||||
except IOError, e:
|
||||
if e.errno != errno.ENOENT:
|
||||
raise
|
||||
return None
|
||||
expiration_secs = settings.getint('HTTPCACHE_EXPIRATION_SECS')
|
||||
if expiration_secs >= 0:
|
||||
expiration_date = metadata['timestamp'] + datetime.timedelta(seconds=expiration_secs)
|
||||
if datetime.datetime.utcnow() > expiration_date:
|
||||
log.msg('dropping old cached response from %s' % metadata['timestamp'], \
|
||||
level=log.DEBUG, domain=domain)
|
||||
return None
|
||||
return metadata
|
||||
|
||||
def retrieve_response(self, domain, key):
|
||||
"""
|
||||
Return response dictionary if request has correspondent cache record;
|
||||
return None if not.
|
||||
"""
|
||||
metadata = self.read_meta(domain, key)
|
||||
def retrieve_response(self, spider, request):
|
||||
"""Return response if present in cache, or None otherwise."""
|
||||
metadata = self._read_meta(spider, request)
|
||||
if metadata is None:
|
||||
return None # not cached
|
||||
|
||||
requestpath = self.requestpath(domain, key)
|
||||
responsebody = responseheaders = None
|
||||
with open(os.path.join(requestpath, 'response_body')) as f:
|
||||
responsebody = f.read()
|
||||
with open(os.path.join(requestpath, 'response_headers')) as f:
|
||||
responseheaders = f.read()
|
||||
|
||||
return # not cached
|
||||
rpath = self._get_request_path(spider, request)
|
||||
with open(join(rpath, 'response_body'), 'rb') as f:
|
||||
body = f.read()
|
||||
with open(join(rpath, 'response_headers'), 'rb') as f:
|
||||
rawheaders = f.read()
|
||||
url = metadata['url']
|
||||
headers = Headers(headers_raw_to_dict(responseheaders))
|
||||
status = metadata['status']
|
||||
|
||||
headers = Headers(headers_raw_to_dict(rawheaders))
|
||||
respcls = responsetypes.from_args(headers=headers, url=url)
|
||||
response = respcls(url=url, headers=headers, status=status, body=responsebody)
|
||||
response.meta['cached'] = True
|
||||
response.flags.append('cached')
|
||||
response = respcls(url=url, headers=headers, status=status, body=body)
|
||||
return response
|
||||
|
||||
def store(self, domain, key, request, response):
|
||||
requestpath = self.requestpath(domain, key)
|
||||
if not os.path.exists(requestpath):
|
||||
os.makedirs(requestpath)
|
||||
|
||||
def store_response(self, spider, request, response):
|
||||
"""Store the given response in the cache."""
|
||||
rpath = self._get_request_path(spider, request)
|
||||
if not exists(rpath):
|
||||
os.makedirs(rpath)
|
||||
metadata = {
|
||||
'url':request.url,
|
||||
'method': request.method,
|
||||
'status': response.status,
|
||||
'domain': domain,
|
||||
'timestamp': datetime.datetime.utcnow(),
|
||||
}
|
||||
|
||||
# metadata
|
||||
with open(os.path.join(requestpath, 'meta_data'), 'w') as f:
|
||||
'url': request.url,
|
||||
'method': request.method,
|
||||
'status': response.status,
|
||||
'timestamp': time(),
|
||||
}
|
||||
with open(join(rpath, 'meta'), 'wb') as f:
|
||||
f.write(repr(metadata))
|
||||
# pickled metadata (to recover without using eval)
|
||||
with open(os.path.join(requestpath, 'pickled_meta'), 'w') as f:
|
||||
pickle.dump(metadata, f)
|
||||
# response
|
||||
with open(os.path.join(requestpath, 'response_headers'), 'w') as f:
|
||||
with open(join(rpath, 'pickled_meta'), 'wb') as f:
|
||||
pickle.dump(metadata, f, protocol=2)
|
||||
with open(join(rpath, 'response_headers'), 'wb') as f:
|
||||
f.write(headers_dict_to_raw(response.headers))
|
||||
with open(os.path.join(requestpath, 'response_body'), 'w') as f:
|
||||
with open(join(rpath, 'response_body'), 'wb') as f:
|
||||
f.write(response.body)
|
||||
# request
|
||||
with open(os.path.join(requestpath, 'request_headers'), 'w') as f:
|
||||
with open(join(rpath, 'request_headers'), 'wb') as f:
|
||||
f.write(headers_dict_to_raw(request.headers))
|
||||
if request.body:
|
||||
with open(os.path.join(requestpath, 'request_body'), 'w') as f:
|
||||
f.write(request.body)
|
||||
with open(join(rpath, 'request_body'), 'wb') as f:
|
||||
f.write(request.body)
|
||||
|
||||
def _get_request_path(self, spider, request):
|
||||
key = request_fingerprint(request)
|
||||
return join(self.cachedir, spider.domain_name, key[0:2], key)
|
||||
|
||||
def _read_meta(self, spider, request):
|
||||
rpath = self._get_request_path(spider, request)
|
||||
metapath = join(rpath, 'pickled_meta')
|
||||
if not exists(metapath):
|
||||
return # not found
|
||||
mtime = os.stat(rpath).st_mtime
|
||||
if 0 <= self.expiration_secs < time() - mtime:
|
||||
return # expired
|
||||
with open(metapath, 'rb') as f:
|
||||
return pickle.load(f)
|
||||
|
86
scrapy/tests/test_downloadermiddleware_httpcache.py
Normal file
86
scrapy/tests/test_downloadermiddleware_httpcache.py
Normal file
@ -0,0 +1,86 @@
|
||||
import unittest, tempfile, shutil, time
|
||||
|
||||
from scrapy.http import Response, HtmlResponse, Request
|
||||
from scrapy.spider import BaseSpider
|
||||
from scrapy.contrib.downloadermiddleware.httpcache import FilesystemCacheStorage, HttpCacheMiddleware
|
||||
from scrapy.conf import Settings
|
||||
from scrapy.core.exceptions import IgnoreRequest
|
||||
|
||||
|
||||
class HttpCacheMiddlewareTest(unittest.TestCase):
|
||||
|
||||
storage_class = FilesystemCacheStorage
|
||||
|
||||
def setUp(self):
|
||||
self.spider = BaseSpider('example.com')
|
||||
self.tmpdir = tempfile.mkdtemp()
|
||||
self.request = Request('http://www.example.com', headers={'User-Agent': 'test'})
|
||||
self.response = Response('http://www.example.com', headers={'Content-Type': 'text/html'}, body='test body', status=202)
|
||||
|
||||
def tearDown(self):
|
||||
shutil.rmtree(self.tmpdir)
|
||||
|
||||
def _get_settings(self, **new_settings):
|
||||
settings = {
|
||||
'HTTPCACHE_DIR': self.tmpdir,
|
||||
'HTTPCACHE_EXPIRATION_SECS': 1,
|
||||
}
|
||||
settings.update(new_settings)
|
||||
return Settings(settings)
|
||||
|
||||
def _get_storage(self, **new_settings):
|
||||
return self.storage_class(self._get_settings(**new_settings))
|
||||
|
||||
def _get_middleware(self, **new_settings):
|
||||
return HttpCacheMiddleware(self._get_settings(**new_settings))
|
||||
|
||||
def test_storage(self):
|
||||
storage = self._get_storage()
|
||||
request2 = self.request.copy()
|
||||
assert storage.retrieve_response(self.spider, request2) is None
|
||||
storage.store_response(self.spider, self.request, self.response)
|
||||
response2 = storage.retrieve_response(self.spider, request2)
|
||||
assert isinstance(response2, HtmlResponse) # inferred from content-type header
|
||||
self.assertEqualResponse(self.response, response2)
|
||||
time.sleep(2) # wait for cache to expire
|
||||
assert storage.retrieve_response(self.spider, request2) is None
|
||||
|
||||
def test_storage_expire_immediately(self):
|
||||
storage = self._get_storage(HTTPCACHE_EXPIRATION_SECS=0)
|
||||
assert storage.retrieve_response(self.spider, self.request) is None
|
||||
storage.store_response(self.spider, self.request, self.response)
|
||||
assert storage.retrieve_response(self.spider, self.request) is None
|
||||
|
||||
def test_storage_never_expire(self):
|
||||
storage = self._get_storage(HTTPCACHE_EXPIRATION_SECS=-1)
|
||||
assert storage.retrieve_response(self.spider, self.request) is None
|
||||
storage.store_response(self.spider, self.request, self.response)
|
||||
assert storage.retrieve_response(self.spider, self.request)
|
||||
|
||||
def test_middleware(self):
|
||||
mw = HttpCacheMiddleware(self._get_settings())
|
||||
assert mw.process_request(self.request, self.spider) is None
|
||||
mw.process_response(self.request, self.response, self.spider)
|
||||
response = mw.process_request(self.request, self.spider)
|
||||
assert isinstance(response, HtmlResponse)
|
||||
self.assertEqualResponse(self.response, response)
|
||||
assert 'cached' in response.flags
|
||||
|
||||
def test_middleware_ignore_missing(self):
|
||||
mw = self._get_middleware(HTTPCACHE_IGNORE_MISSING=True)
|
||||
self.assertRaises(IgnoreRequest, mw.process_request, self.request, self.spider)
|
||||
mw.process_response(self.request, self.response, self.spider)
|
||||
response = mw.process_request(self.request, self.spider)
|
||||
assert isinstance(response, HtmlResponse)
|
||||
self.assertEqualResponse(self.response, response)
|
||||
assert 'cached' in response.flags
|
||||
|
||||
def assertEqualResponse(self, response1, response2):
|
||||
self.assertEqual(response1.url, response2.url)
|
||||
self.assertEqual(response1.status, response2.status)
|
||||
self.assertEqual(response1.headers, response2.headers)
|
||||
self.assertEqual(response1.body, response2.body)
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
||||
|
Loading…
x
Reference in New Issue
Block a user