mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-24 05:45:31 +00:00
Request.cb_kwargs: Update docs
This commit is contained in:
parent
e8af6331b5
commit
8fb077694f
@ -28,16 +28,15 @@ Consider the following scrapy spider below::
|
||||
item = MyItem()
|
||||
# populate `item` fields
|
||||
# and extract item_details_url
|
||||
yield scrapy.Request(item_details_url, self.parse_details, meta={'item': item})
|
||||
yield scrapy.Request(item_details_url, self.parse_details, cb_kwargs={'item': item})
|
||||
|
||||
def parse_details(self, response):
|
||||
item = response.meta['item']
|
||||
def parse_details(self, response, item):
|
||||
# populate more `item` fields
|
||||
return item
|
||||
|
||||
Basically this is a simple spider which parses two pages of items (the
|
||||
start_urls). Items also have a details page with additional information, so we
|
||||
use the ``meta`` functionality of :class:`~scrapy.http.Request` to pass a
|
||||
use the ``cb_kwargs`` functionality of :class:`~scrapy.http.Request` to pass a
|
||||
partially populated item.
|
||||
|
||||
|
||||
@ -100,8 +99,7 @@ Fortunately, the :command:`shell` is your bread and butter in this case (see
|
||||
|
||||
from scrapy.shell import inspect_response
|
||||
|
||||
def parse_details(self, response):
|
||||
item = response.meta.get('item', None)
|
||||
def parse_details(self, response, item=None):
|
||||
if item:
|
||||
# populate more `item` fields
|
||||
return item
|
||||
@ -134,8 +132,7 @@ Logging is another useful option for getting information about your spider run.
|
||||
Although not as convenient, it comes with the advantage that the logs will be
|
||||
available in all future runs should they be necessary again::
|
||||
|
||||
def parse_details(self, response):
|
||||
item = response.meta.get('item', None)
|
||||
def parse_details(self, response, item=None):
|
||||
if item:
|
||||
# populate more `item` fields
|
||||
return item
|
||||
|
@ -81,7 +81,8 @@ So, for example, this won't work::
|
||||
|
||||
def some_callback(self, response):
|
||||
somearg = 'test'
|
||||
return scrapy.Request('http://www.example.com', callback=lambda r: self.other_callback(r, somearg))
|
||||
return scrapy.Request('http://www.example.com',
|
||||
callback=lambda r: self.other_callback(r, somearg))
|
||||
|
||||
def other_callback(self, response, somearg):
|
||||
print("the argument passed is: %s" % somearg)
|
||||
@ -90,10 +91,10 @@ But this will::
|
||||
|
||||
def some_callback(self, response):
|
||||
somearg = 'test'
|
||||
return scrapy.Request('http://www.example.com', callback=self.other_callback, meta={'somearg': somearg})
|
||||
return scrapy.Request('http://www.example.com',
|
||||
callback=self.other_callback, cb_kwargs={'somearg': somearg})
|
||||
|
||||
def other_callback(self, response):
|
||||
somearg = response.meta['somearg']
|
||||
def other_callback(self, response, somearg):
|
||||
print("the argument passed is: %s" % somearg)
|
||||
|
||||
If you wish to log the requests that couldn't be serialized, you can set the
|
||||
|
@ -27,10 +27,11 @@ Common causes of memory leaks
|
||||
|
||||
It happens quite often (sometimes by accident, sometimes on purpose) that the
|
||||
Scrapy developer passes objects referenced in Requests (for example, using the
|
||||
:attr:`~scrapy.http.Request.meta` attribute or the request callback function)
|
||||
and that effectively bounds the lifetime of those referenced objects to the
|
||||
lifetime of the Request. This is, by far, the most common cause of memory leaks
|
||||
in Scrapy projects, and a quite difficult one to debug for newcomers.
|
||||
:attr:`~scrapy.http.Request.cb_kwargs` or :attr:`~scrapy.http.Request.meta`
|
||||
attributes or the request callback function) and that effectively bounds the
|
||||
lifetime of those referenced objects to the lifetime of the Request. This is,
|
||||
by far, the most common cause of memory leaks in Scrapy projects, and a quite
|
||||
difficult one to debug for newcomers.
|
||||
|
||||
In big projects, the spiders are typically written by different people and some
|
||||
of those spiders could be "leaking" and thus affecting the rest of the other
|
||||
@ -48,7 +49,8 @@ Too Many Requests?
|
||||
|
||||
By default Scrapy keeps the request queue in memory; it includes
|
||||
:class:`~scrapy.http.Request` objects and all objects
|
||||
referenced in Request attributes (e.g. in :attr:`~scrapy.http.Request.meta`).
|
||||
referenced in Request attributes (e.g. in :attr:`~scrapy.http.Request.cb_kwargs`
|
||||
and :attr:`~scrapy.http.Request.meta`).
|
||||
While not necessarily a leak, this can take a lot of memory. Enabling
|
||||
:ref:`persistent job queue <topics-jobs>` could help keeping memory usage
|
||||
in control.
|
||||
@ -101,7 +103,7 @@ Let's see a concrete example of a hypothetical case of memory leaks.
|
||||
Suppose we have some spider with a line similar to this one::
|
||||
|
||||
return Request("http://www.somenastyspider.com/product.php?pid=%d" % product_id,
|
||||
callback=self.parse, meta={referer: response})
|
||||
callback=self.parse, cb_kwargs={'referer': response})
|
||||
|
||||
That line is passing a response reference inside a request which effectively
|
||||
ties the response lifetime to the requests' one, and that would definitely
|
||||
|
@ -186,12 +186,12 @@ Request objects
|
||||
Return a new Request which is a copy of this Request. See also:
|
||||
:ref:`topics-request-response-ref-request-callback-arguments`.
|
||||
|
||||
.. method:: Request.replace([url, method, headers, body, cookies, meta, encoding, dont_filter, callback, errback])
|
||||
.. method:: Request.replace([url, method, headers, body, cookies, meta, flags, encoding, priority, dont_filter, callback, errback, cb_kwargs])
|
||||
|
||||
Return a Request object with the same members, except for those members
|
||||
given new values by whichever keyword arguments are specified. The
|
||||
attribute :attr:`Request.meta` is copied by default (unless a new value
|
||||
is given in the ``meta`` argument). See also
|
||||
:attr:`Request.cb_kwargs` and :attr:`Request.meta` attributes are copied by default
|
||||
(unless new values are given as arguments). See also
|
||||
:ref:`topics-request-response-ref-request-callback-arguments`.
|
||||
|
||||
.. _topics-request-response-ref-request-callback-arguments:
|
||||
@ -237,11 +237,10 @@ The following example shows how to achieve this by using the
|
||||
|
||||
.. caution:: :attr:`Request.cb_kwargs` was introduced in version ``1.7``.
|
||||
Prior to that, :attr:`Request.meta` was the recommended option for passing
|
||||
information around callbacks. However, after ``1.7`` :attr:`Request.cb_kwargs`
|
||||
information around callbacks. However, after ``1.7``, using :attr:`Request.cb_kwargs`
|
||||
became the preferred way of passing user information, leaving :attr:`Request.meta`
|
||||
to be used by internal components like spider or downloader middlewares.
|
||||
The following example, which uses :attr:`Request.meta`, is only kept for historical
|
||||
reasons.
|
||||
to be populated by internal components like spider or downloader middlewares.
|
||||
The following :attr:`Request.meta` example is only kept for historical reasons.
|
||||
|
||||
::
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user