mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-25 18:43:58 +00:00
131 lines
5.4 KiB
ReStructuredText
131 lines
5.4 KiB
ReStructuredText
.. _topics-downloader-middleware:
|
|
|
|
=====================
|
|
Downloader Middleware
|
|
=====================
|
|
|
|
The downloader middleware is a framework of hooks into Scrapy's
|
|
request/response processing. It's a light, low-level system for globally
|
|
altering Scrapy's requests and responses.
|
|
|
|
.. _topics-downloader-middleware-setting:
|
|
|
|
Activating a downloader middleware
|
|
==================================
|
|
|
|
To activate a downloader middleware component, add it to the
|
|
:setting:`DOWNLOADER_MIDDLEWARES` setting, which is a dict whose keys are the
|
|
middleware class paths and their values are the middleware orders.
|
|
|
|
Here's an example::
|
|
|
|
DOWNLOADER_MIDDLEWARES = {
|
|
'myproject.middlewares.CustomDownloaderMiddleware': 543,
|
|
}
|
|
|
|
The :setting:`DOWNLOADER_MIDDLEWARES` setting is merged with the
|
|
:setting:`DOWNLOADER_MIDDLEWARES_BASE` setting defined in Scrapy (and not meant to
|
|
be overridden) and then sorted by order to get the final sorted list of enabled
|
|
middlewares: the first middleware is the one closer to the engine and the last
|
|
is the one closer to the downloader.
|
|
|
|
To decide which order to assign to your middleware see the
|
|
:setting:`DOWNLOADER_MIDDLEWARES_BASE` setting and pick a value according to
|
|
where you want to insert the middleware. The order does matter because each
|
|
middleware performs a different action and your middleware could depend on some
|
|
previous (or subsequent) middleware being applied.
|
|
|
|
If you want to disable a builtin middleware (the ones defined in
|
|
:setting:`DOWNLOADER_MIDDLEWARES_BASE` and enabled by default) you must define it
|
|
in your project :setting:`DOWNLOADER_MIDDLEWARES` setting and assign `None`
|
|
as its value. For example, if you want to disable the off-site middleware::
|
|
|
|
DOWNLOADER_MIDDLEWARES = {
|
|
'myproject.middlewares.CustomDownloaderMiddleware': 543,
|
|
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
|
|
}
|
|
|
|
Finally, keep in mind that some middlewares may need to be enabled through a
|
|
particular setting. See each middleware documentation for more info.
|
|
|
|
Writing your own downloader middleware
|
|
======================================
|
|
|
|
Writing your own downloader middleware is easy. Each middleware component is a
|
|
single Python class that defines one or more of the following methods:
|
|
|
|
|
|
.. method:: process_request(request, spider)
|
|
|
|
``request`` is a :class:`~scrapy.http.Request` object
|
|
``spider`` is a :class:`~scrapy.spider.BaseSpider` object
|
|
|
|
This method is called for each request that goes through the download
|
|
middleware.
|
|
|
|
``process_request()`` should return either ``None``, a
|
|
:class:`~scrapy.http.Response` object, or a :class:`~scrapy.http.Request`
|
|
object.
|
|
|
|
If returns ``None``, Scrapy will continue processing this request, executing all
|
|
other middlewares until, finally, the appropriate downloader handler is called
|
|
the request performed (and its response downloaded).
|
|
|
|
If returns a Response object, Scrapy won't bother calling ANY other request or
|
|
exception middleware, or the appropriate download function; it'll return that
|
|
Response. Response middleware is always called on every response.
|
|
|
|
If returns a :class:`~scrapy.http.Request` object, the returned request will be
|
|
re-scheduled (in the Scheduler) to be downloaded in the future. The callback of
|
|
the original request will always be called. If the new request has a callback
|
|
it will be called with the response downloaded, and the output of that callback
|
|
will then be passed to the original callback. If the new request doesn't have a
|
|
callback, the response downloaded will be just passed to the original request
|
|
callback.
|
|
|
|
If returns an :exception:`IgnoreRequest` exception, the entire request will be
|
|
dropped completely and its callback never called.
|
|
|
|
|
|
.. method:: process_response(request, response, spider)
|
|
|
|
``request`` is a :class:`~scrapy.http.Request` object
|
|
``response`` is a :class:`~scrapy.http.Response` object
|
|
``spider`` is a BaseSpider object
|
|
|
|
``process_response()`` should return a Response object or raise a
|
|
:exception:`IgnoreRequest` exception.
|
|
|
|
If returns a Response (it could be the same given response, or a brand-new one)
|
|
that response will continue to be processed with the ``process_response()`` of
|
|
the next middleware in the pipeline.
|
|
|
|
If returns an :exception:`IgnoreRequest` exception, the response will be
|
|
dropped completely and its callback never called.
|
|
|
|
.. method:: process_download_exception(request, exception, spider)
|
|
|
|
``request`` is a :class:`~scrapy.http.Request` object.
|
|
``exception`` is an Exception object
|
|
``spider`` is a BaseSpider object
|
|
|
|
Scrapy calls ``process_download_exception()`` when a download handler or a
|
|
``process_request()`` (from a downloader middleware) raises an exception.
|
|
|
|
``process_download_exception()`` should return either ``None``,
|
|
:class:`~scrapy.http.Response` or :class:`~scrapy.http.Request` object.
|
|
|
|
If it returns ``None``, Scrapy will continue processing this exception,
|
|
executing any other exception middleware, until no middleware is left and
|
|
the default exception handling kicks in.
|
|
|
|
If it returns a :class:`~scrapy.http.Response` object, the response middleware
|
|
kicks in, and won't bother calling any other exception middleware.
|
|
|
|
If it returns a :class:`~scrapy.http.Request` object, returned request is used
|
|
to instruct a immediate redirection. Redirection is handled inside middleware
|
|
scope, and the original request won't finish until redirected request is
|
|
completed. This stop ``process_download_exception()`` middleware as returning Response
|
|
would do.
|
|
|