2009-08-18 14:05:15 -03:00
|
|
|
.. _topics-request-response:
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
======================
|
|
|
|
Requests and Responses
|
|
|
|
======================
|
2009-01-03 03:10:14 +00:00
|
|
|
|
|
|
|
.. module:: scrapy.http
|
2009-04-03 03:20:45 +00:00
|
|
|
:synopsis: Request and Response classes
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-01-17 21:05:08 +00:00
|
|
|
Scrapy uses :class:`Request` and :class:`Response` objects for crawling web
|
|
|
|
sites.
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-01-17 21:05:08 +00:00
|
|
|
Typically, :class:`Request` objects are generated in the spiders and pass
|
|
|
|
across the system until they reach the Downloader, which executes the request
|
2009-04-12 08:31:55 +00:00
|
|
|
and returns a :class:`Response` object which travels back to the spider that
|
|
|
|
issued the request.
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
Both :class:`Request` and :class:`Response` classes have subclasses which adds
|
|
|
|
additional functionality not required in the base classes. These are described
|
2009-08-18 14:05:15 -03:00
|
|
|
below in :ref:`topics-request-response-ref-request-subclasses` and
|
|
|
|
:ref:`topics-request-response-ref-response-subclasses`.
|
|
|
|
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-01-03 03:10:14 +00:00
|
|
|
Request objects
|
|
|
|
===============
|
|
|
|
|
2009-06-11 22:25:47 -03:00
|
|
|
.. class:: Request(url[, callback, method='GET', body, headers, cookies, meta, encoding='utf-8', priority=0.0, dont_filter=False, errback])
|
2009-01-17 21:05:08 +00:00
|
|
|
|
|
|
|
A :class:`Request` object represents an HTTP request, which is usually
|
|
|
|
generated in the Spider and executed by the Downloader, and thus generating
|
|
|
|
a :class:`Response`.
|
|
|
|
|
2009-04-20 02:31:22 +00:00
|
|
|
:param url: the URL of this request
|
|
|
|
:type url: string
|
|
|
|
|
|
|
|
:param callback: the function that will be called with the response of this
|
|
|
|
request (once its downloaded) as its first parameter. For more information
|
2009-08-18 14:05:15 -03:00
|
|
|
see :ref:`topics-request-response-ref-request-callback-arguments` below.
|
2009-04-20 02:31:22 +00:00
|
|
|
:type callback: callable
|
|
|
|
|
|
|
|
:param method: the HTTP method of this request. Defaults to ``'GET'``.
|
|
|
|
:type method: string
|
|
|
|
|
|
|
|
:param meta: the initial values for the :attr:`Request.meta` attribute. If
|
|
|
|
given, the dict passed in this parameter will be shallow copied.
|
|
|
|
:type meta: dict
|
|
|
|
|
|
|
|
:param body: the request body. If a ``unicode`` is passed, then it's encoded to
|
|
|
|
``str`` using the `encoding` passed (which defaults to ``utf-8``). If
|
|
|
|
``body`` is not given,, an empty string is stored. Regardless of the
|
|
|
|
type of this argument, the final value stored will be a ``str``` (never
|
|
|
|
``unicode`` or ``None``).
|
|
|
|
:type body: str or unicode
|
|
|
|
|
|
|
|
:param headers: the headers of this request. The dict values can be strings
|
|
|
|
(for single valued headers) or lists (for multi-valued headers).
|
|
|
|
:type headers: dict
|
|
|
|
|
|
|
|
:param cookies: the request cookies. Example::
|
|
|
|
|
|
|
|
request_with_cookies = Request(url="http://www.example.com",
|
|
|
|
cookies={currency: 'USD', country: 'UY'})
|
|
|
|
|
|
|
|
When some site returns cookies (in a response) those are stored in the
|
|
|
|
cookies for that domain and will be sent again in future requests. That's
|
|
|
|
the typical behaviour of any regular web browser. However, if, for some
|
|
|
|
reason, you want to avoid merging with existing cookies you can instruct
|
|
|
|
Scrapy to do so by setting the ``dont_merge_cookies`` item in the
|
|
|
|
:attr:`Request.meta`.
|
|
|
|
|
|
|
|
Example of request without merging cookies::
|
|
|
|
|
|
|
|
request_with_cookies = Request(url="http://www.example.com",
|
|
|
|
cookies={currency: 'USD', country: 'UY'},
|
|
|
|
meta={'dont_merge_cookies': True})
|
|
|
|
:type cookies: dict
|
|
|
|
|
|
|
|
:param encoding: the encoding of this request (defaults to ``'utf-8'``).
|
|
|
|
This encoding will be used to percent-encode the URL and to convert the
|
|
|
|
body to ``str`` (if given as ``unicode``).
|
|
|
|
:type encoding: string
|
|
|
|
|
2009-06-11 22:25:47 -03:00
|
|
|
:param priority: the priority of this request (defaults to ``0.0``).
|
|
|
|
The priority is used by the scheduler to define the order used to return
|
|
|
|
requests. It can also be used to feed priorities externally, for
|
|
|
|
example, using an offline long-term scheduler.
|
|
|
|
:type encoding: int or float
|
|
|
|
|
2009-04-20 02:31:22 +00:00
|
|
|
:param dont_filter: indicates that this request should not be filtered by
|
|
|
|
the scheduler. This is used when you want to perform an identical
|
|
|
|
request multiple times, to ignore the duplicates filter. Use it with
|
|
|
|
care, or you will get into crawling loops. Default to ``False``.
|
|
|
|
:type dont_filter: boolean
|
|
|
|
|
|
|
|
:param errback: a function that will be called if any exception was
|
|
|
|
raised while processing the request. This includes pages that failed
|
|
|
|
with 404 HTTP errors and such. It receives a `Twisted Failure`_ instance
|
|
|
|
as first parameter.
|
|
|
|
:type errback: callable
|
2009-03-25 13:15:55 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. _Twisted Failure: http://twistedmatrix.com/documents/8.2.0/api/twisted.python.failure.Failure.html
|
2009-03-25 13:15:55 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. attribute:: Request.url
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
A string containing the URL of this request. Keep in mind that this
|
|
|
|
attribute contains the escaped URL, so it can differ from the URL passed in
|
|
|
|
the constructor.
|
2009-01-17 21:05:08 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. attribute:: Request.method
|
2009-01-17 21:05:08 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
A string representing the HTTP method in the request. This is guaranteed to
|
|
|
|
be uppercase. Example: ``"GET"``, ``"POST"``, ``"PUT"``, etc
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. attribute:: Request.headers
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
A dictionary-like object which contains the request headers.
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. attribute:: Request.body
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
A str that contains the request body
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. attribute:: Request.meta
|
2009-01-15 03:24:48 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
A dict that contains arbitrary metadata for this request. This dict is
|
|
|
|
empty for new Requests, and is usually populated by different Scrapy
|
|
|
|
components (extensions, middlewares, etc). So the data contained in this
|
|
|
|
dict depends on the extensions you have enabled.
|
2009-01-15 03:24:48 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
This dict is `shallow copied`_ when the request is cloned using the
|
|
|
|
``copy()`` or ``replace()`` methods.
|
2009-01-15 03:24:48 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. _shallow copied: http://docs.python.org/library/copy.html
|
2009-01-15 03:24:48 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. attribute:: Request.cache
|
2009-01-15 03:24:48 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
A dict that contains arbitrary cached data for this request. This dict is
|
|
|
|
empty for new Requests, and is usually populated by different Scrapy
|
|
|
|
components (extensions, middlewares, etc) to avoid duplicate processing. So
|
|
|
|
the data contained in this dict depends on the extensions you have enabled.
|
2009-01-15 03:24:48 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
Unlike the ``meta`` attribute, this dict is not copied at all when the
|
|
|
|
request is cloned using the ``copy()`` or ``replace()`` methods.
|
2009-01-15 03:24:48 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. method:: Request.copy()
|
2009-01-15 03:24:48 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
Return a new Request which is a copy of this Request. The attribute
|
2009-08-18 14:05:15 -03:00
|
|
|
:attr:`Request.meta` is copied, while :attr:`Request.cache` is not. See
|
|
|
|
also :ref:`topics-request-response-ref-request-callback-arguments`.
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. method:: Request.replace([url, callback, method, headers, body, cookies, meta, encoding, dont_filter])
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
Return a Request object with the same members, except for those members
|
2009-08-18 14:05:15 -03:00
|
|
|
given new values by whichever keyword arguments are specified. The
|
|
|
|
attribute :attr:`Request.meta` is copied by default (unless a new value
|
|
|
|
is given in the ``meta`` argument). The :attr:`Request.cache` attribute
|
|
|
|
is always cleared. See also
|
|
|
|
:ref:`topics-request-response-ref-request-callback-arguments`.
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
.. _topics-request-response-ref-callback-copy:
|
2009-03-24 20:02:42 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
Caveats with copying Requests and callbacks
|
|
|
|
-------------------------------------------
|
2009-03-24 20:02:42 +00:00
|
|
|
|
|
|
|
When you copy a request using the :meth:`Request.copy` or
|
|
|
|
:meth:`Request.replace` methods the callback of the request is not copied by
|
|
|
|
default. This is because of legacy reasons along with limitations in the
|
2009-03-25 13:15:55 +00:00
|
|
|
underlying network library, which doesn't allow sharing `Twisted deferreds`_.
|
2009-03-24 20:02:42 +00:00
|
|
|
|
|
|
|
.. _Twisted deferreds: http://twistedmatrix.com/projects/core/documentation/howto/defer.html
|
|
|
|
|
|
|
|
For example::
|
|
|
|
|
|
|
|
request = Request("http://www.example.com", callback=myfunc)
|
|
|
|
request2 = request.copy() # doesn't copy the callback
|
|
|
|
request3 = request.replace(callback=request.callback)
|
|
|
|
|
|
|
|
In the above example, ``request2`` is a copy of ``request`` but it has no
|
|
|
|
callback, while ``request3`` is a copy of ``request`` and also contains the
|
|
|
|
callback.
|
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
.. _topics-request-response-ref-request-callback-arguments:
|
2009-03-22 16:24:56 +00:00
|
|
|
|
|
|
|
Passing arguments to callback functions
|
|
|
|
---------------------------------------
|
|
|
|
|
|
|
|
The callback of a request is a function that will be called when the response
|
|
|
|
of that request is downloaded. The callback function will be called with the
|
|
|
|
:class:`Response` object downloaded as its first argument.
|
|
|
|
|
|
|
|
Example::
|
|
|
|
|
|
|
|
def parse_page1(self, response):
|
|
|
|
request = Request("http://www.example.com/some_page.html",
|
|
|
|
callback=self.parse_page2)
|
|
|
|
|
|
|
|
def parse_page2(self, response):
|
|
|
|
# this would log http://www.example.com/some_page.html
|
|
|
|
self.log("Visited %s" % response.url)
|
|
|
|
|
|
|
|
In some cases you may be interested in passing arguments to those callback
|
|
|
|
functions so you can receive those arguments later, when the response is
|
|
|
|
downloaded. There are two ways for doing this:
|
|
|
|
|
|
|
|
1. using a lambda function (or any other function/callable)
|
|
|
|
|
|
|
|
2. using the :attr:`Request.meta` attribute.
|
|
|
|
|
|
|
|
Here's an example of logging the referer URL of each page using each mechanism.
|
|
|
|
Keep in mind, however, that the referer URL could be accessed easier via
|
|
|
|
``response.request.url``).
|
|
|
|
|
|
|
|
Using lambda function::
|
|
|
|
|
|
|
|
def parse_page1(self, response):
|
|
|
|
myarg = response.url
|
|
|
|
request = Request("http://www.example.com/some_page.html",
|
|
|
|
callback=lambda r: self.parse_page2(r, myarg))
|
|
|
|
|
|
|
|
def parse_page2(self, response, referer_url):
|
2009-03-22 16:27:35 +00:00
|
|
|
self.log("Visited page %s from %s" % (response.url, referer_url))
|
2009-03-22 16:24:56 +00:00
|
|
|
|
|
|
|
Using Request.meta::
|
|
|
|
|
|
|
|
def parse_page1(self, response):
|
|
|
|
request = Request("http://www.example.com/some_page.html",
|
2009-03-22 16:27:35 +00:00
|
|
|
callback=self.parse_page2)
|
2009-03-22 16:24:56 +00:00
|
|
|
request.meta['referer_url'] = response.url
|
|
|
|
|
|
|
|
def parse_page2(self, response):
|
2009-03-22 16:27:35 +00:00
|
|
|
referer_url = response.request.meta['referer_url']
|
|
|
|
self.log("Visited page %s from %s" % (response.url, referer_url))
|
2009-03-22 16:24:56 +00:00
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
.. _topics-request-response-ref-request-subclasses:
|
2009-01-26 02:57:03 +00:00
|
|
|
|
|
|
|
Request subclasses
|
|
|
|
==================
|
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
Here is the list of built-in :class:`Request` subclasses. You can also subclass
|
|
|
|
it to implement your own custom functionality.
|
2009-01-26 02:57:03 +00:00
|
|
|
|
|
|
|
FormRequest objects
|
|
|
|
-------------------
|
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
The FormRequest class extends the base :class:`Request` with functionality for
|
|
|
|
dealing with HTML forms. It uses the `ClientForm`_ library (bundled with
|
|
|
|
Scrapy) to pre-populate form fields with form data from :class:`Response`
|
|
|
|
objects.
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. _ClientForm: http://wwwsearch.sourceforge.net/ClientForm/
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. class:: FormRequest(url, [formdata, ...])
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
The :class:`FormRequest` class adds a new argument to the constructor. The
|
|
|
|
remaining arguments are the same as for the :class:`Request` class and are
|
|
|
|
not documented here.
|
2009-01-27 12:10:49 +00:00
|
|
|
|
2009-04-20 02:31:22 +00:00
|
|
|
:param formdata: is a dictionary (or iterable of (key, value) tuples)
|
|
|
|
containing HTML Form data which will be url-encoded and assigned to the
|
|
|
|
body of the request.
|
|
|
|
:type formdata: dict or iterable of tuples
|
2009-01-27 12:10:49 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
The :class:`FormRequest` objects support the following class method in
|
|
|
|
addition to the standard :class:`Request` methods:
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-20 02:31:22 +00:00
|
|
|
.. classmethod:: FormRequest.from_response(response, [formnumber=0, formdata, ...])
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-20 02:31:22 +00:00
|
|
|
Returns a new :class:`FormRequest` object with its form field values
|
|
|
|
pre-populated with those found in the HTML ``<form>`` element contained
|
2009-08-18 14:05:15 -03:00
|
|
|
in the given response. For an example see
|
|
|
|
:ref:`topics-request-response-ref-request-userlogin`.
|
2009-01-03 03:10:14 +00:00
|
|
|
|
|
|
|
|
2009-04-20 02:31:22 +00:00
|
|
|
:param response: the response containing a HTML form which will be used
|
|
|
|
to pre-populate the form fields
|
|
|
|
:type response: :class:`Response` object
|
|
|
|
|
|
|
|
:param formnumber: the number of form to use, when the response contains
|
|
|
|
multiple forms. The first one (and also the default) is ``0``.
|
|
|
|
:type formnumber: integer
|
|
|
|
|
|
|
|
:param formdata: fields to override in the form data. If a field was
|
|
|
|
already present in the response ``<form>`` element, its value is
|
|
|
|
overridden by the one passed in this parameter.
|
|
|
|
:type formdata: dict
|
|
|
|
|
|
|
|
The other parameters of this class method are passed directly to the
|
|
|
|
:class:`FormRequest` constructor.
|
2009-01-17 21:05:08 +00:00
|
|
|
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
Request usage examples
|
2009-08-18 14:05:15 -03:00
|
|
|
----------------------
|
2009-01-17 23:57:53 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
Using FormRequest to send data via HTTP POST
|
2009-08-18 14:05:15 -03:00
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
2009-01-17 21:05:08 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
If you want to simulate a HTML Form POST in your spider, and send a couple of
|
|
|
|
key-value fields you could return a :class:`FormRequest` object (from your
|
|
|
|
spider) like this::
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
return [FormRequest(url="http://www.example.com/post/action",
|
|
|
|
formdata={'name': 'John Doe', age: '27'},
|
|
|
|
callback=self.after_post)]
|
2009-01-15 03:24:48 +00:00
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
.. _topics-request-response-ref-request-userlogin:
|
2009-01-17 21:05:08 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
Using FormRequest.from_response() to simulate a user login
|
2009-08-18 14:05:15 -03:00
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
2009-01-17 21:05:08 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
It is usual for web sites to provide pre-populated form fields through ``<input
|
|
|
|
type="hidden">`` elements, such as session related data or authentication
|
|
|
|
tokens (for login pages). When scraping, you'll want these fields to be
|
|
|
|
automatically pre-populated and only override a couple of them, such as the
|
|
|
|
user name and password. You can use the :meth:`FormRequest.from_response`
|
|
|
|
method for this job. Here's an example spider which uses it::
|
2009-01-17 21:05:08 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
class LoginSpider(BaseSpider):
|
|
|
|
domain_name = 'example.com'
|
|
|
|
start_urls = ['http://www.example.com/users/login.php']
|
2009-01-17 21:05:08 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
def parse(self, response):
|
|
|
|
return [FormRequest.from_response(response,
|
|
|
|
formdata={'username': 'john', 'password': 'secret'},
|
|
|
|
callback=self.after_login)]
|
2009-01-17 21:05:08 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
def after_login(self, response):
|
|
|
|
# check login succeed before going on
|
|
|
|
if "authentication failed" in response.body:
|
|
|
|
self.log("Login failed", level=log.ERROR)
|
|
|
|
return
|
|
|
|
|
|
|
|
# continue scraping with authenticated session...
|
2009-01-17 21:05:08 +00:00
|
|
|
|
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
Response objects
|
|
|
|
================
|
2009-01-15 03:24:48 +00:00
|
|
|
|
2009-04-20 02:31:22 +00:00
|
|
|
.. class:: Response(url, [status=200, headers, body, meta, flags])
|
2009-01-15 03:24:48 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
A :class:`Response` object represents an HTTP response, which is usually
|
|
|
|
downloaded (by the Downloader) and fed to the Spiders for processing.
|
|
|
|
|
2009-04-20 02:31:22 +00:00
|
|
|
:param url: the URL of this response
|
|
|
|
:type url: string
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-20 02:31:22 +00:00
|
|
|
:param headers: the headers of this response. The dict values can be strings
|
|
|
|
(for single valued headers) or lists (for multi-valued headers).
|
|
|
|
:type headers: dict
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-20 02:31:22 +00:00
|
|
|
:param status: the HTTP status of the response. Defaults to ``200``.
|
|
|
|
:type status: integer
|
2009-01-15 03:24:48 +00:00
|
|
|
|
2009-04-20 02:31:22 +00:00
|
|
|
:param body: the response body. It must be str, not unicode, unless you're
|
2009-08-18 14:05:15 -03:00
|
|
|
using a encoding-aware :ref:`Response subclass
|
|
|
|
<topics-request-response-ref-response-subclasses>`, such as
|
|
|
|
:class:`TextResponse`.
|
2009-04-20 02:31:22 +00:00
|
|
|
:type body: str
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-20 02:31:22 +00:00
|
|
|
:param meta: the initial values for the :attr:`Response.meta` attribute. If
|
|
|
|
given, the dict will be shallow copied.
|
|
|
|
:type meta: dict
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-20 02:31:22 +00:00
|
|
|
:param flags: is a list containing the initial values for the
|
|
|
|
:attr:`Response.flags` attribute. If given, the list will be shallow
|
|
|
|
copied.
|
|
|
|
:type flags: list
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. attribute:: Response.url
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
A string containing the URL of the response.
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. attribute:: Response.status
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
An integer representing the HTTP status of the response. Example: ``200``,
|
|
|
|
``404``.
|
2009-01-03 03:10:14 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. attribute:: Response.headers
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
A dictionary-like object which contains the response headers.
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. attribute:: Response.body
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
A str containing the body of this Response. Keep in mind that Reponse.body
|
|
|
|
is always a str. If you want the unicode version use
|
|
|
|
:meth:`TextResponse.body_as_unicode` (only available in
|
|
|
|
:class:`TextResponse` and subclasses).
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. attribute:: Response.request
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
The :class:`Request` object that generated this response. This attribute is
|
|
|
|
assigned in the Scrapy engine, after the response and request has passed
|
|
|
|
through all :ref:`Downloader Middlewares <topics-downloader-middleware>`.
|
|
|
|
In particular, this means that:
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
- HTTP redirections will cause the original request (to the URL before
|
|
|
|
redirection) to be assigned to the redirected response (with the final
|
|
|
|
URL after redirection).
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
- Response.request.url doesn't always equals Response.url
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
- This attribute is only available in the spider code, and in the
|
|
|
|
:ref:`Spider Middlewares <topics-spider-middleware>`, but not in
|
|
|
|
Downloader Middlewares (although you have the Request available there by
|
|
|
|
other means) and handlers of the :signal:`response_downloaded` signal.
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. attribute:: Response.meta
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
A dict that contains arbitrary metadata for this response, similar to the
|
|
|
|
:attr:`Request.meta` attribute. See the :attr:`Request.meta` attribute for
|
|
|
|
more info.
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. attribute:: Response.flags
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
A list that contains flags for this response. Flags are labels used for
|
|
|
|
tagging Responses. For example: `'cached'`, `'redirected`', etc. And
|
|
|
|
they're shown on the string representation of the Response (`__str__`
|
|
|
|
method) which is used by the engine for logging.
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. attribute:: Response.cache
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
A dict that contains arbitrary cached data for this response, similar to
|
|
|
|
the :attr:`Request.cache` attribute. See the :attr:`Request.cache`
|
|
|
|
attribute for more info.
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. method:: Response.copy()
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
Return a new Response which is a copy of this Response. The attribute
|
|
|
|
:attr:`Response.meta` is copied, while :attr:`Response.cache` is not.
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. method:: Response.replace([url, status, headers, body, meta, flags, cls])
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
Return a Response object with the same members, except for those members
|
|
|
|
given new values by whichever keyword arguments are specified. The
|
|
|
|
attribute :attr:`Response.meta` is copied by default (unless a new value
|
|
|
|
is given in the ``meta`` argument). The :attr:`Response.cache`
|
|
|
|
attribute is always cleared.
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
.. _topics-request-response-ref-response-subclasses:
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
Response subclasses
|
|
|
|
===================
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
Here is the list of available built-in Response subclasses. You can also
|
|
|
|
subclass the Response class to implement your own functionality.
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
TextResponse objects
|
|
|
|
--------------------
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. class:: TextResponse(url, [encoding[, ...]])
|
|
|
|
|
|
|
|
:class:`TextResponse` objects adds encoding capabilities to the base
|
|
|
|
:class:`Response` class, which is meant to be used only for binary data,
|
|
|
|
such as images, sounds or any media file.
|
|
|
|
|
|
|
|
:class:`TextResponse` objects support a new constructor arguments, in
|
|
|
|
addition to the base :class:`Response` objects. The remaining functionality
|
|
|
|
is the same as for the :class:`Response` class and is not documented here.
|
|
|
|
|
2009-04-20 02:31:22 +00:00
|
|
|
:param encoding: is a string which contains the encoding to use for this
|
|
|
|
response. If you create a :class:`TextResponse` object with a unicode
|
|
|
|
body it will be encoded using this encoding (remember the body attribute
|
|
|
|
is always a string). If ``encoding`` is ``None`` (default value), the
|
|
|
|
encoding will be looked up in the response headers anb body instead.
|
|
|
|
:type encoding: string
|
2009-04-12 08:31:55 +00:00
|
|
|
|
|
|
|
:class:`TextResponse` objects support the following attributes in addition
|
|
|
|
to the standard :class:`Response` ones:
|
|
|
|
|
|
|
|
.. attribute:: TextResponse.encoding
|
|
|
|
|
|
|
|
A string with the encoding of this response. The encoding is resolved in the
|
|
|
|
following order:
|
|
|
|
|
|
|
|
1. the encoding passed in the constructor `encoding` argument
|
|
|
|
|
|
|
|
2. the encoding declared in the Content-Type HTTP header
|
|
|
|
|
|
|
|
3. the encoding declared in the response body. The TextResponse class
|
|
|
|
doesn't provide any special functionality for this. However, the
|
|
|
|
:class:`HtmlResponse` and :class:`XmlResponse` classes do.
|
|
|
|
|
|
|
|
4. the encoding inferred by looking at the response body. This is the more
|
|
|
|
fragile method but also the last one tried.
|
|
|
|
|
|
|
|
:class:`TextResponse` objects support the following methods in addition to
|
|
|
|
the standard :class:`Response` ones:
|
|
|
|
|
|
|
|
.. method:: TextResponse.headers_encoding()
|
|
|
|
|
|
|
|
Returns a string with the encoding declared in the headers (ie. the
|
|
|
|
Content-Type HTTP header).
|
|
|
|
|
|
|
|
.. method:: TextResponse.body_encoding()
|
|
|
|
|
|
|
|
Returns a string with the encoding of the body, either declared or inferred
|
|
|
|
from its contents. The body encoding declaration is implemented in
|
|
|
|
:class:`TextResponse` subclasses such as: :class:`HtmlResponse` or
|
|
|
|
:class:`XmlResponse`.
|
|
|
|
|
|
|
|
.. method:: TextResponse.body_as_unicode()
|
|
|
|
|
|
|
|
Returns the body of the response as unicode. This is equivalent to::
|
|
|
|
|
|
|
|
response.body.encode(response.encoding)
|
|
|
|
|
2009-04-20 02:31:22 +00:00
|
|
|
But **not** equivalent to::
|
2009-04-12 08:31:55 +00:00
|
|
|
|
|
|
|
unicode(response.body)
|
|
|
|
|
2009-04-20 02:31:22 +00:00
|
|
|
Since, in the latter case, you would be using you system default encoding
|
|
|
|
(typically `ascii`) to convert the body to uniode, instead of the response
|
2009-04-12 08:31:55 +00:00
|
|
|
encoding.
|
|
|
|
|
2009-01-26 02:57:03 +00:00
|
|
|
HtmlResponse objects
|
|
|
|
--------------------
|
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. class:: HtmlResponse(url[, ...])
|
|
|
|
|
|
|
|
The :class:`HtmlResponse` class is a subclass of :class:`TextResponse`
|
|
|
|
which adds encoding auto-discovering support by looking into the HTML `meta
|
|
|
|
http-equiv`_ attribute. See :attr:`TextResponse.encoding`.
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. _meta http-equiv: http://www.w3schools.com/TAGS/att_meta_http_equiv.asp
|
2009-01-26 02:57:03 +00:00
|
|
|
|
|
|
|
XmlResponse objects
|
|
|
|
-------------------
|
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
.. class:: XmlResponse(url[, ...])
|
2009-01-26 02:57:03 +00:00
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
The :class:`XmlResponse` class is a subclass of :class:`TextResponse` which
|
|
|
|
adds encoding auto-discovering support by looking into the XML declaration
|
|
|
|
line. See :attr:`TextResponse.encoding`.
|
2009-01-26 02:57:03 +00:00
|
|
|
|