1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-25 18:43:58 +00:00
scrapy/docs/topics/downloader-middleware.rst

131 lines
5.4 KiB
ReStructuredText

.. _topics-downloader-middleware:
=====================
Downloader Middleware
=====================
The downloader middleware is a framework of hooks into Scrapy's
request/response processing. It's a light, low-level system for globally
altering Scrapy's requests and responses.
.. _topics-downloader-middleware-setting:
Activating a downloader middleware
==================================
To activate a downloader middleware component, add it to the
:setting:`DOWNLOADER_MIDDLEWARES` setting, which is a dict whose keys are the
middleware class paths and their values are the middleware orders.
Here's an example::
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomDownloaderMiddleware': 543,
}
The :setting:`DOWNLOADER_MIDDLEWARES` setting is merged with the
:setting:`DOWNLOADER_MIDDLEWARES_BASE` setting defined in Scrapy (and not meant to
be overridden) and then sorted by order to get the final sorted list of enabled
middlewares: the first middleware is the one closer to the engine and the last
is the one closer to the downloader.
To decide which order to assign to your middleware see the
:setting:`DOWNLOADER_MIDDLEWARES_BASE` setting and pick a value according to
where you want to insert the middleware. The order does matter because each
middleware performs a different action and your middleware could depend on some
previous (or subsequent) middleware being applied.
If you want to disable a builtin middleware (the ones defined in
:setting:`DOWNLOADER_MIDDLEWARES_BASE` and enabled by default) you must define it
in your project :setting:`DOWNLOADER_MIDDLEWARES` setting and assign `None`
as its value. For example, if you want to disable the off-site middleware::
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomDownloaderMiddleware': 543,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
Finally, keep in mind that some middlewares may need to be enabled through a
particular setting. See each middleware documentation for more info.
Writing your own downloader middleware
======================================
Writing your own downloader middleware is easy. Each middleware component is a
single Python class that defines one or more of the following methods:
.. method:: process_request(request, spider)
``request`` is a :class:`~scrapy.http.Request` object
``spider`` is a :class:`~scrapy.spider.BaseSpider` object
This method is called for each request that goes through the download
middleware.
``process_request()`` should return either ``None``, a
:class:`~scrapy.http.Response` object, or a :class:`~scrapy.http.Request`
object.
If returns ``None``, Scrapy will continue processing this request, executing all
other middlewares until, finally, the appropriate downloader handler is called
the request performed (and its response downloaded).
If returns a Response object, Scrapy won't bother calling ANY other request or
exception middleware, or the appropriate download function; it'll return that
Response. Response middleware is always called on every response.
If returns a :class:`~scrapy.http.Request` object, the returned request will be
re-scheduled (in the Scheduler) to be downloaded in the future. The callback of
the original request will always be called. If the new request has a callback
it will be called with the response downloaded, and the output of that callback
will then be passed to the original callback. If the new request doesn't have a
callback, the response downloaded will be just passed to the original request
callback.
If returns an :exception:`IgnoreRequest` exception, the entire request will be
dropped completely and its callback never called.
.. method:: process_response(request, response, spider)
``request`` is a :class:`~scrapy.http.Request` object
``response`` is a :class:`~scrapy.http.Response` object
``spider`` is a BaseSpider object
``process_response()`` should return a Response object or raise a
:exception:`IgnoreRequest` exception.
If returns a Response (it could be the same given response, or a brand-new one)
that response will continue to be processed with the ``process_response()`` of
the next middleware in the pipeline.
If returns an :exception:`IgnoreRequest` exception, the response will be
dropped completely and its callback never called.
.. method:: process_download_exception(request, exception, spider)
``request`` is a :class:`~scrapy.http.Request` object.
``exception`` is an Exception object
``spider`` is a BaseSpider object
Scrapy calls ``process_download_exception()`` when a download handler or a
``process_request()`` (from a downloader middleware) raises an exception.
``process_download_exception()`` should return either ``None``,
:class:`~scrapy.http.Response` or :class:`~scrapy.http.Request` object.
If it returns ``None``, Scrapy will continue processing this exception,
executing any other exception middleware, until no middleware is left and
the default exception handling kicks in.
If it returns a :class:`~scrapy.http.Response` object, the response middleware
kicks in, and won't bother calling any other exception middleware.
If it returns a :class:`~scrapy.http.Request` object, returned request is used
to instruct a immediate redirection. Redirection is handled inside middleware
scope, and the original request won't finish until redirected request is
completed. This stop ``process_download_exception()`` middleware as returning Response
would do.