Refactored HttpCache middleware:

* simplified code * performance improvements * removed awkward/unused domain sectorization * it can now receive Settings on constructor * added unittests * added documentation about filesystem storage structure Also made scrapy.conf.Settings objects instantiable with a dict which is used to override default settings.
2025-02-24 06:43:43 +00:00 · 2009-11-13 14:25:47 -02:00 · 2009-11-13 14:25:47 -02:00 · 564abd10ad
commit 564abd10ad
parent db7fec1fef
6 changed files with 240 additions and 189 deletions
--- a/docs/topics/downloader-middleware.rst
+++ b/docs/topics/downloader-middleware.rst
@ -214,25 +214,86 @@ HttpCacheMiddleware
 .. class:: HttpCacheMiddleware

    This middleware provides low-level cache to all HTTP requests and responses.
-    Every request and its corresponding response are cached and then, when that
-    same request is seen again, the response is returned without transferring
+    Every request and its corresponding response are cached. When the same
+    request is seen again, the response is returned without transferring
    anything from the Internet.

    The HTTP cache is useful for testing spiders faster (without having to wait for
    downloads every time) and for trying your spider off-line when you don't have
    an Internet connection.

-    The :class:`HttpCacheMiddleware` can be configured through the following
-    settings (see the settings documentation for more info):
+File system storage
+~~~~~~~~~~~~~~~~~~~
+
+By default, the :class:`HttpCacheMiddleware` uses a file system storage  with the following structure:
+
+Each request/response pair is stored in a different directory containing with
+the following files:
+
+ * ``request_body`` - the plain request body
+ * ``request_headers`` - the request headers (in raw HTTP format)
+ * ``response_body`` - the plain response body
+ * ``response_headers`` - the request headers (in raw HTTP format)
+ * ``meta`` - some metadata of this cache resource in Python ``repr()`` format
+   (for easy grepeability)
+ * ``pickled_meta`` - the same metadata in ``meta`` but pickled for more
+   efficient deserialization
+
+The directory name is made from the request fingerprint (see
+``scrapy.utils.request.fingerprint``), and one level of subdirectories is
+used to avoid creating too many files into the same directory (which is
+inefficient in many file systems). An example directory could be::
+
+   /path/to/cache/dir/example.com/72/72811f648e718090f041317756c03adb0ada46c7
+
+The cache storage backend can be changed with the :setting:`HTTPCACHE_STORAGE`
+setting, but no other backend is provided with Scrapy yet.
+
+Settings
+~~~~~~~~
+
+The :class:`HttpCacheMiddleware` can be configured through the following
+settings:
+
+.. setting:: HTTPCACHE_DIR
+
+HTTPCACHE_DIR
+^^^^^^^^^^^^^
+
+Default: ``''`` (empty string)
+
+The directory to use for storing the (low-level) HTTP cache. If empty the HTTP
+cache will be disabled.
+
+.. setting:: HTTPCACHE_EXPIRATION_SECS
+
+HTTPCACHE_EXPIRATION_SECS
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Default: ``0``
+
+Number of seconds to use for HTTP cache expiration. Requests that were cached
+before this time will be re-downloaded. If zero, cached requests will always
+expire. Negative numbers means requests will never expire.
+
+.. setting:: HTTPCACHE_IGNORE_MISSING
+
+HTTPCACHE_IGNORE_MISSING
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Default: ``False``
+
+If enabled, requests not found in the cache will be ignored instead of downloaded. 
+
+.. setting:: HTTPCACHE_STORAGE
+
+HTTPCACHE_STORAGE
+^^^^^^^^^^^^^^^^^
+
+Default: ``'scrapy.contrib.downloadermiddleware.httpcache.FilesystemCacheStorage'``
+
+The class which implements the cache storage backend.

-    * :setting:`HTTPCACHE_DIR` - this one actually enables the cache besides
-      settings the cache dir
-    * :setting:`HTTPCACHE_IGNORE_MISSING` - ignoring missing requests instead
-      of downloading them
-    * :setting:`HTTPCACHE_SECTORIZE` - split HTTP cache in several directories
-      (for performance reasons)
-    * :setting:`HTTPCACHE_EXPIRATION_SECS` - how many secs until the cache is
-      considered out of date

 .. _topics-dlmw-robots:

--- a/docs/topics/settings.rst
+++ b/docs/topics/settings.rst
@ -220,45 +220,6 @@ Default: ``1.0``
 The version of the bot implemented by this Scrapy project. This will be used to
 construct the User-Agent by default.

-.. setting:: HTTPCACHE_DIR
-
-HTTPCACHE_DIR
-------------
-
-Default: ``''`` (empty string)
-
-The directory to use for storing the (low-level) HTTP cache. If empty the HTTP
-cache will be disabled.
-
-.. setting:: HTTPCACHE_EXPIRATION_SECS
-
-HTTPCACHE_EXPIRATION_SECS
-------------------------
-
-Default: ``0``
-
-Number of seconds to use for HTTP cache expiration. Requests that were cached
-before this time will be re-downloaded. If zero, cached requests will always
-expire. Negative numbers means requests will never expire.
-
-.. setting:: HTTPCACHE_IGNORE_MISSING
-
-HTTPCACHE_IGNORE_MISSING
------------------------
-
-Default: ``False``
-
-If enabled, requests not found in the cache will be ignored instead of downloaded. 
-
-.. setting:: HTTPCACHE_SECTORIZE
-
-HTTPCACHE_SECTORIZE
-------------------
-
-Default: ``True``
-
-Whether to split HTTP cache storage in several dirs for performance.
-
 .. setting:: COMMANDS_MODULE

 COMMANDS_MODULE
--- a/scrapy/conf/init.py
+++ b/scrapy/conf/init.py
@ -13,7 +13,7 @@ import_ = lambda x: __import__(x, {}, {}, [''])

 class Settings(object):

-    def __init__(self):
+    def __init__(self, overrides=None):
        self.defaults = {}
        self.global_defaults = default_settings
        self.disabled = os.environ.get('SCRAPY_SETTINGS_DISABLED', False)
@ -24,6 +24,8 @@ class Settings(object):
        # XXX: find a better solution for this hack
        pickled_settings = os.environ.get("SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE")
        self.overrides = pickle.loads(pickled_settings) if pickled_settings else {}
+        if overrides:
+            self.overrides.update(overrides)

    def __getitem__(self, opt_name):
        if not self.disabled:
--- a/scrapy/conf/default_settings.py
+++ b/scrapy/conf/default_settings.py
@ -94,7 +94,7 @@ GROUPSETTINGS_MODULE = ''

 HTTPCACHE_DIR = ''
 HTTPCACHE_IGNORE_MISSING = False
-HTTPCACHE_SECTORIZE = True
+HTTPCACHE_STORAGE = 'scrapy.contrib.downloadermiddleware.httpcache.FilesystemCacheStorage'
 HTTPCACHE_EXPIRATION_SECS = 0

 ITEM_PROCESSOR = 'scrapy.contrib.pipeline.ItemPipelineManager'
--- a/scrapy/contrib/downloadermiddleware/httpcache.py
+++ b/scrapy/contrib/downloadermiddleware/httpcache.py
@ -1,181 +1,122 @@
 from __future__ import with_statement

-import errno
 import os
-import hashlib
-import datetime
+from os.path import join, exists
+from time import time
 import cPickle as pickle
-from scrapy.xlib.pydispatch import dispatcher

+from scrapy.xlib.pydispatch import dispatcher
 from scrapy.core import signals
-from scrapy import log
 from scrapy.http import Headers
 from scrapy.core.exceptions import NotConfigured, IgnoreRequest
 from scrapy.core.downloader.responsetypes import responsetypes
 from scrapy.utils.request import request_fingerprint
 from scrapy.utils.http import headers_dict_to_raw, headers_raw_to_dict
 from scrapy.utils.httpobj import urlparse_cached
-from scrapy.conf import settings
+from scrapy.utils.misc import load_object
+from scrapy import conf


 class HttpCacheMiddleware(object):
-    def __init__(self):
-        if not settings['HTTPCACHE_DIR']:
-            raise NotConfigured
-        self.cache = Cache(settings['HTTPCACHE_DIR'], sectorize=settings.getbool('HTTPCACHE_SECTORIZE'))
-        self.ignore_missing = settings.getbool('HTTPCACHE_IGNORE_MISSING')
-        dispatcher.connect(self.open_domain, signal=signals.spider_opened)

-    def open_domain(self, spider):
-        self.cache.open_domain(spider.domain_name)
+    def __init__(self, settings=conf.settings):
+        self.storage = load_object(settings['HTTPCACHE_STORAGE'])(settings)
+        self.ignore_missing = settings.getbool('HTTPCACHE_IGNORE_MISSING')
+        dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
+        dispatcher.connect(self.spider_closed, signal=signals.spider_closed)
+
+    def spider_opened(self, spider):
+        self.storage.open_spider(spider)
+
+    def spider_closed(self, spider):
+        self.storage.close_spider(spider)

    def process_request(self, request, spider):
-        if not is_cacheable(request):
+        if not self.is_cacheable(request):
            return
-
-        key = request_fingerprint(request)
-        domain = spider.domain_name
-
-        try:
-            response = self.cache.retrieve_response(domain, key)
-        except:
-            log.msg("Corrupt cache for %s" % request.url, log.WARNING)
-            response = False
-
+        response = self.storage.retrieve_response(spider, request)
        if response:
+            response.flags.append('cached')
            return response
        elif self.ignore_missing:
            raise IgnoreRequest("Ignored request not in cache: %s" % request)

    def process_response(self, request, response, spider):
-        if is_cacheable(request):
-            key = request_fingerprint(request)
-            self.cache.store(spider.domain_name, key, request, response)
-
+        if self.is_cacheable(request):
+            self.storage.store_response(spider, request, response)
        return response

-
-def is_cacheable(request):
-    return urlparse_cached(request).scheme in ['http', 'https']
+    def is_cacheable(self, request):
+        return urlparse_cached(request).scheme in ['http', 'https']


-class Cache(object):
-    DOMAIN_SECTORDIR = 'data'
-    DOMAIN_LINKDIR = 'domains'
+class FilesystemCacheStorage(object):

-    def __init__(self, cachedir, sectorize=False):
+    def __init__(self, settings=conf.settings):
+        cachedir = settings['HTTPCACHE_DIR']
+        if not cachedir:
+            raise NotConfigured
        self.cachedir = cachedir
-        self.sectorize = sectorize
+        self.expiration_secs = settings.getint('HTTPCACHE_EXPIRATION_SECS')

-        self.baselinkpath = os.path.join(self.cachedir, self.DOMAIN_LINKDIR)
-        if not os.path.exists(self.baselinkpath):
-            os.makedirs(self.baselinkpath)
+    def open_spider(self, spider):
+        pass

-        self.basesectorpath = os.path.join(self.cachedir, self.DOMAIN_SECTORDIR)
-        if not os.path.exists(self.basesectorpath):
-            os.makedirs(self.basesectorpath)
+    def close_spider(self, spider):
+        pass

-    def domainsectorpath(self, domain):
-        sector = hashlib.sha1(domain).hexdigest()[0]
-        return os.path.join(self.basesectorpath, sector, domain)
-
-    def domainlinkpath(self, domain):
-        return os.path.join(self.baselinkpath, domain)
-
-    def requestpath(self, domain, key):
-        linkpath = self.domainlinkpath(domain)
-        return os.path.join(linkpath, key[0:2], key)
-
-    def open_domain(self, domain):
-        if domain:
-            linkpath = self.domainlinkpath(domain)
-            if self.sectorize:
-                sectorpath = self.domainsectorpath(domain)
-                if not os.path.exists(sectorpath):
-                    os.makedirs(sectorpath)
-                if not os.path.exists(linkpath):
-                    try:
-                        os.symlink(sectorpath, linkpath)
-                    except:
-                        os.makedirs(linkpath) # windows filesystem
-            else:
-                if not os.path.exists(linkpath):
-                    os.makedirs(linkpath)
-
-    def read_meta(self, domain, key):
-        """Return the metadata dictionary (possibly empty) if the entry is
-        cached, None otherwise.
-        """
-        requestpath = self.requestpath(domain, key)
-        try:
-            with open(os.path.join(requestpath, 'pickled_meta'), 'r') as f:
-                metadata = pickle.load(f)
-        except IOError, e:
-            if e.errno != errno.ENOENT:
-                raise
-            return None
-        expiration_secs = settings.getint('HTTPCACHE_EXPIRATION_SECS')
-        if expiration_secs >= 0:
-            expiration_date = metadata['timestamp'] + datetime.timedelta(seconds=expiration_secs)
-            if datetime.datetime.utcnow() > expiration_date:
-                log.msg('dropping old cached response from %s' % metadata['timestamp'], \
-                    level=log.DEBUG, domain=domain)
-                return None
-        return metadata
-
-    def retrieve_response(self, domain, key):
-        """
-        Return response dictionary if request has correspondent cache record;
-        return None if not.
-        """
-        metadata = self.read_meta(domain, key)
+    def retrieve_response(self, spider, request):
+        """Return response if present in cache, or None otherwise."""
+        metadata = self._read_meta(spider, request)
        if metadata is None:
-            return None # not cached
-
-        requestpath = self.requestpath(domain, key)
-        responsebody = responseheaders = None
-        with open(os.path.join(requestpath, 'response_body')) as f:
-            responsebody = f.read()
-        with open(os.path.join(requestpath, 'response_headers')) as f:
-            responseheaders = f.read()
-
+            return # not cached
+        rpath = self._get_request_path(spider, request)
+        with open(join(rpath, 'response_body'), 'rb') as f:
+            body = f.read()
+        with open(join(rpath, 'response_headers'), 'rb') as f:
+            rawheaders = f.read()
        url = metadata['url']
-        headers = Headers(headers_raw_to_dict(responseheaders))
        status = metadata['status']
-
+        headers = Headers(headers_raw_to_dict(rawheaders))
        respcls = responsetypes.from_args(headers=headers, url=url)
-        response = respcls(url=url, headers=headers, status=status, body=responsebody)
-        response.meta['cached'] = True
-        response.flags.append('cached')
+        response = respcls(url=url, headers=headers, status=status, body=body)
        return response

-    def store(self, domain, key, request, response):
-        requestpath = self.requestpath(domain, key)
-        if not os.path.exists(requestpath):
-            os.makedirs(requestpath)
-
+    def store_response(self, spider, request, response):
+        """Store the given response in the cache."""
+        rpath = self._get_request_path(spider, request)
+        if not exists(rpath):
+            os.makedirs(rpath)
        metadata = {
-                'url':request.url,
-                'method': request.method,
-                'status': response.status,
-                'domain': domain,
-                'timestamp': datetime.datetime.utcnow(),
-            }
-
-        # metadata
-        with open(os.path.join(requestpath, 'meta_data'), 'w') as f:
+            'url': request.url,
+            'method': request.method,
+            'status': response.status,
+            'timestamp': time(),
+        }
+        with open(join(rpath, 'meta'), 'wb') as f:
            f.write(repr(metadata))
-        # pickled metadata (to recover without using eval)
-        with open(os.path.join(requestpath, 'pickled_meta'), 'w') as f:
-            pickle.dump(metadata, f)
-        # response
-        with open(os.path.join(requestpath, 'response_headers'), 'w') as f:
+        with open(join(rpath, 'pickled_meta'), 'wb') as f:
+            pickle.dump(metadata, f, protocol=2)
+        with open(join(rpath, 'response_headers'), 'wb') as f:
            f.write(headers_dict_to_raw(response.headers))
-        with open(os.path.join(requestpath, 'response_body'), 'w') as f:
+        with open(join(rpath, 'response_body'), 'wb') as f:
            f.write(response.body)
-        # request
-        with open(os.path.join(requestpath, 'request_headers'), 'w') as f:
+        with open(join(rpath, 'request_headers'), 'wb') as f:
            f.write(headers_dict_to_raw(request.headers))
-        if request.body:
-            with open(os.path.join(requestpath, 'request_body'), 'w') as f:
-                f.write(request.body)
+        with open(join(rpath, 'request_body'), 'wb') as f:
+            f.write(request.body)
+
+    def _get_request_path(self, spider, request):
+        key = request_fingerprint(request)
+        return join(self.cachedir, spider.domain_name, key[0:2], key)
+
+    def _read_meta(self, spider, request):
+        rpath = self._get_request_path(spider, request)
+        metapath = join(rpath, 'pickled_meta')
+        if not exists(metapath):
+            return # not found
+        mtime = os.stat(rpath).st_mtime
+        if 0 <= self.expiration_secs < time() - mtime:
+            return # expired
+        with open(metapath, 'rb') as f:
+            return pickle.load(f)
--- a/scrapy/tests/test_downloadermiddleware_httpcache.py
+++ b/scrapy/tests/test_downloadermiddleware_httpcache.py
@ -0,0 +1,86 @@
+import unittest, tempfile, shutil, time
+
+from scrapy.http import Response, HtmlResponse, Request
+from scrapy.spider import BaseSpider
+from scrapy.contrib.downloadermiddleware.httpcache import FilesystemCacheStorage, HttpCacheMiddleware
+from scrapy.conf import Settings
+from scrapy.core.exceptions import IgnoreRequest
+
+
+class HttpCacheMiddlewareTest(unittest.TestCase):
+
+    storage_class = FilesystemCacheStorage
+
+    def setUp(self):
+        self.spider = BaseSpider('example.com')
+        self.tmpdir = tempfile.mkdtemp()
+        self.request = Request('http://www.example.com', headers={'User-Agent': 'test'})
+        self.response = Response('http://www.example.com', headers={'Content-Type': 'text/html'}, body='test body', status=202)
+
+    def tearDown(self):
+        shutil.rmtree(self.tmpdir)
+
+    def _get_settings(self, **new_settings):
+        settings = {
+            'HTTPCACHE_DIR': self.tmpdir,
+            'HTTPCACHE_EXPIRATION_SECS': 1,
+        }
+        settings.update(new_settings)
+        return Settings(settings)
+
+    def _get_storage(self, **new_settings):
+        return self.storage_class(self._get_settings(**new_settings))
+
+    def _get_middleware(self, **new_settings):
+        return HttpCacheMiddleware(self._get_settings(**new_settings))
+
+    def test_storage(self):
+        storage = self._get_storage()
+        request2 = self.request.copy()
+        assert storage.retrieve_response(self.spider, request2) is None
+        storage.store_response(self.spider, self.request, self.response)
+        response2 = storage.retrieve_response(self.spider, request2)
+        assert isinstance(response2, HtmlResponse) # inferred from content-type header
+        self.assertEqualResponse(self.response, response2)
+        time.sleep(2) # wait for cache to expire
+        assert storage.retrieve_response(self.spider, request2) is None
+
+    def test_storage_expire_immediately(self):
+        storage = self._get_storage(HTTPCACHE_EXPIRATION_SECS=0)
+        assert storage.retrieve_response(self.spider, self.request) is None
+        storage.store_response(self.spider, self.request, self.response)
+        assert storage.retrieve_response(self.spider, self.request) is None
+
+    def test_storage_never_expire(self):
+        storage = self._get_storage(HTTPCACHE_EXPIRATION_SECS=-1)
+        assert storage.retrieve_response(self.spider, self.request) is None
+        storage.store_response(self.spider, self.request, self.response)
+        assert storage.retrieve_response(self.spider, self.request)
+
+    def test_middleware(self):
+        mw = HttpCacheMiddleware(self._get_settings())
+        assert mw.process_request(self.request, self.spider) is None
+        mw.process_response(self.request, self.response, self.spider)
+        response = mw.process_request(self.request, self.spider)
+        assert isinstance(response, HtmlResponse)
+        self.assertEqualResponse(self.response, response)
+        assert 'cached' in response.flags
+
+    def test_middleware_ignore_missing(self):
+        mw = self._get_middleware(HTTPCACHE_IGNORE_MISSING=True)
+        self.assertRaises(IgnoreRequest, mw.process_request, self.request, self.spider)
+        mw.process_response(self.request, self.response, self.spider)
+        response = mw.process_request(self.request, self.spider)
+        assert isinstance(response, HtmlResponse)
+        self.assertEqualResponse(self.response, response)
+        assert 'cached' in response.flags
+
+    def assertEqualResponse(self, response1, response2):
+        self.assertEqual(response1.url, response2.url)
+        self.assertEqual(response1.status, response2.status)
+        self.assertEqual(response1.headers, response2.headers)
+        self.assertEqual(response1.body, response2.body)
+
+if __name__ == '__main__':
+    unittest.main()
+