mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-23 14:04:22 +00:00
612 lines
24 KiB
ReStructuredText
612 lines
24 KiB
ReStructuredText
.. _topics-loaders:
|
|
|
|
============
|
|
Item Loaders
|
|
============
|
|
|
|
.. module:: scrapy.contrib.loader
|
|
:synopsis: Item Loader class
|
|
|
|
Item Loaders provide a convenient mechanism for populating scraped :ref:`Items
|
|
<topics-items>`. Even though Items can be populated using their own
|
|
dictionary-like API, the Item Loaders provide a much more convenient API for
|
|
populating them from a scraping process, by automating some common tasks like
|
|
parsing the raw extracted data before assigning it.
|
|
|
|
In other words, :ref:`Items <topics-items>` provide the *container* of
|
|
scraped data, while Item Loaders provide the mechanism for *populating* that
|
|
container.
|
|
|
|
Item Loaders are designed to provide a flexible, efficient and easy mechanism
|
|
for extending and overriding different field parsing rules, either by spider,
|
|
or by source format (HTML, XML, etc) without becoming a nightmare to maintain.
|
|
|
|
Using Item Loaders to populate items
|
|
====================================
|
|
|
|
To use an Item Loader, you must first instantiate it. You can either
|
|
instantiate it with an dict-like object (e.g. Item or dict) or without one, in
|
|
which case an Item is automatically instantiated in the Item Loader constructor
|
|
using the Item class specified in the :attr:`ItemLoader.default_item_class`
|
|
attribute.
|
|
|
|
Then, you start collecting values into the Item Loader, typically using
|
|
:ref:`XPath Selectors <topics-selectors>`. You can add more than one value to
|
|
the same item field; the Item Loader will know how to "join" those values later
|
|
using a proper processing function.
|
|
|
|
Here is a typical Item Loader usage in a :ref:`Spider <topics-spiders>`, using
|
|
the :ref:`Product item <topics-items-declaring>` declared in the :ref:`Items
|
|
chapter <topics-items>`::
|
|
|
|
from scrapy.contrib.loader import XPathItemLoader
|
|
from myproject.items import Product
|
|
|
|
def parse(self, response):
|
|
l = XPathItemLoader(item=Product(), response=response)
|
|
l.add_xpath('name', '//div[@class="product_name"]')
|
|
l.add_xpath('name', '//div[@class="product_title"]')
|
|
l.add_xpath('price', '//p[@id="price"]')
|
|
l.add_xpath('stock', '//p[@id="stock"]')
|
|
l.add_value('last_updated', 'today') # you can also use literal values
|
|
return l.load_item()
|
|
|
|
By quickly looking at that code, we can see the ``name`` field is being
|
|
extracted from two different XPath locations in the page:
|
|
|
|
1. ``//div[@class="product_name"]``
|
|
2. ``//div[@class="product_title"]``
|
|
|
|
In other words, data is being collected by extracting it from two XPath
|
|
locations, using the :meth:`~XPathItemLoader.add_xpath` method. This is the
|
|
data that will be assigned to the ``name`` field later.
|
|
|
|
Afterwards, similar calls are used for ``price`` and ``stock`` fields, and
|
|
finally the ``last_update`` field is populated directly with a literal value
|
|
(``today``) using a different method: :meth:`~ItemLoader.add_value`.
|
|
|
|
Finally, when all data is collected, the :meth:`ItemLoader.load_item` method is
|
|
called which actually populates and returns the item populated with the data
|
|
previously extracted and collected with the :meth:`~XPathItemLoader.add_xpath`
|
|
and :meth:`~ItemLoader.add_value` calls.
|
|
|
|
.. _topics-loaders-processors:
|
|
|
|
Input and Output processors
|
|
===========================
|
|
|
|
An Item Loader contains one input processor and one output processor for each
|
|
(item) field. The input processor processes the extracted data as soon as it's
|
|
received (through the :meth:`~XPathItemLoader.add_xpath` or
|
|
:meth:`~ItemLoader.add_value` methods) and the result of the input processor is
|
|
collected and kept inside the ItemLoader. After collecting all data, the
|
|
:meth:`ItemLoader.load_item` method is called to populate and get the populated
|
|
:class:`~scrapy.item.Item` object. That's when the output processor is
|
|
called with the data previously collected (and processed using the input
|
|
processor). The result of the output processor is the final value that gets
|
|
assigned to the item.
|
|
|
|
Let's see an example to illustrate how the input and output processors are
|
|
called for a particular field (the same applies for any other field)::
|
|
|
|
l = XPathItemLoader(Product(), some_xpath_selector)
|
|
l.add_xpath('name', xpath1) # (1)
|
|
l.add_xpath('name', xpath2) # (2)
|
|
l.add_value('name', 'test') # (3)
|
|
return l.load_item() # (4)
|
|
|
|
So what happens is:
|
|
|
|
1. Data from ``xpath1`` is extracted, and passed through the *input processor* of
|
|
the ``name`` field. The result of the input processor is collected and kept in
|
|
the Item Loader (but not yet assigned to the item).
|
|
|
|
2. Data from ``xpath2`` is extracted, and passed through the same *input
|
|
processor* used in (1). The result of the input processor is appended to the
|
|
data collected in (1) (if any).
|
|
|
|
3. This case is similar to the previous ones, except that the value to be
|
|
collected is assigned directly, instead of being extracted from a XPath.
|
|
However, the value is still passed through the input processors. In this
|
|
case, since the value is not iterable it is converted to an iterable of a
|
|
single element before passing it to the input processor, because input
|
|
processor always receive iterables.
|
|
|
|
4. The data collected in (1) and (2) is passed through the *output processor* of
|
|
the ``name`` field. The result of the output processor is the value assigned to
|
|
the ``name`` field in the item.
|
|
|
|
It's worth noticing that processors are just callable objects, which are called
|
|
with the data to be parsed, and return a parsed value. So you can use any
|
|
function as input or output processor. The only requirement is that they must
|
|
accept one (and only one) positional argument, which will be an iterator.
|
|
|
|
.. note:: Both input and output processors must receive an iterator as their
|
|
first argument. The output of those functions can be anything. The result of
|
|
input processors will be appended to an internal list (in the Loader)
|
|
containing the collected values (for that field). The result of the output
|
|
processors is the value that will be finally assigned to the item.
|
|
|
|
The other thing you need to keep in mind is that the values returned by input
|
|
processors are collected internally (in lists) and then passed to output
|
|
processors to populate the fields.
|
|
|
|
Last, but not least, Scrapy comes with some :ref:`commonly used processors
|
|
<topics-loaders-available-processors>` built-in for convenience.
|
|
|
|
|
|
Declaring Item Loaders
|
|
======================
|
|
|
|
Item Loaders are declared like Items, by using a class definition syntax. Here
|
|
is an example::
|
|
|
|
from scrapy.contrib.loader import ItemLoader
|
|
from scrapy.contrib.loader.processor import TakeFirst, MapCompose, Join
|
|
|
|
class ProductLoader(ItemLoader):
|
|
|
|
default_output_processor = TakeFirst()
|
|
|
|
name_in = MapCompose(unicode.title)
|
|
name_out = Join()
|
|
|
|
price_in = MapCompose(unicode.strip)
|
|
|
|
# ...
|
|
|
|
As you can see, input processors are declared using the ``_in`` suffix while
|
|
output processors are declared using the ``_out`` suffix. And you can also
|
|
declare a default input/output processors using the
|
|
:attr:`ItemLoader.default_input_processor` and
|
|
:attr:`ItemLoader.default_output_processor` attributes.
|
|
|
|
.. _topics-loaders-processors-declaring:
|
|
|
|
Declaring Input and Output Processors
|
|
=====================================
|
|
|
|
As seen in the previous section, input and output processors can be declared in
|
|
the Item Loader definition, and it's very common to declare input processors
|
|
this way. However, there is one more place where you can specify the input and
|
|
output processors to use: in the :ref:`Item Field <topics-items-fields>`
|
|
metadata. Here is an example::
|
|
|
|
from scrapy.item import Item, Field
|
|
from scrapy.contrib.loader.processor import MapCompose, Join, TakeFirst
|
|
|
|
from scrapy.utils.markup import remove_entities
|
|
from myproject.utils import filter_prices
|
|
|
|
class Product(Item):
|
|
name = Field(
|
|
input_processor=MapCompose(remove_entities),
|
|
output_processor=Join(),
|
|
)
|
|
price = Field(
|
|
default=0,
|
|
input_processor=MapCompose(remove_entities, filter_prices),
|
|
output_processor=TakeFirst(),
|
|
)
|
|
|
|
The precedence order, for both input and output processors, is as follows:
|
|
|
|
1. Item Loader field-specific attributes: ``field_in`` and ``field_out`` (most
|
|
precedence)
|
|
2. Field metadata (``input_processor`` and ``output_processor`` key)
|
|
3. Item Loader defaults: :meth:`ItemLoader.default_input_processor` and
|
|
:meth:`ItemLoader.default_output_processor` (least precedence)
|
|
|
|
See also: :ref:`topics-loaders-extending`.
|
|
|
|
.. _topics-loaders-context:
|
|
|
|
Item Loader Context
|
|
===================
|
|
|
|
The Item Loader Context is a dict of arbitrary key/values which is shared among
|
|
all input and output processors in the Item Loader. It can be passed when
|
|
declaring, instantiating or using Item Loader. They are used to modify the
|
|
behaviour of the input/output processors.
|
|
|
|
For example, suppose you have a function ``parse_length`` which receives a text
|
|
value and extracts a length from it::
|
|
|
|
def parse_length(text, loader_context):
|
|
unit = loader_context.get('unit', 'm')
|
|
# ... length parsing code goes here ...
|
|
return parsed_length
|
|
|
|
By accepting a ``loader_context`` argument the function is explicitly telling
|
|
the Item Loader that is able to receive an Item Loader context, so the Item
|
|
Loader passes the currently active context when calling it, and the processor
|
|
function (``parse_length`` in this case) can thus use them.
|
|
|
|
There are several ways to modify Item Loader context values:
|
|
|
|
1. By modifying the currently active Item Loader context
|
|
(:attr:`~ItemLoader.context` attribute)::
|
|
|
|
loader = ItemLoader(product)
|
|
loader.context['unit'] = 'cm'
|
|
|
|
2. On Item Loader instantiation (the keyword arguments of Item Loader
|
|
constructor are stored in the Item Loader context)::
|
|
|
|
loader = ItemLoader(product, unit='cm')
|
|
|
|
3. On Item Loader declaration, for those input/output processors that support
|
|
instatiating them with a Item Loader context. :class:`~processor.MapCompose` is one of
|
|
them::
|
|
|
|
class ProductLoader(ItemLoader):
|
|
length_out = MapCompose(parse_length, unit='cm')
|
|
|
|
|
|
ItemLoader objects
|
|
==================
|
|
|
|
.. class:: ItemLoader([item], \**kwargs)
|
|
|
|
Return a new Item Loader for populating the given Item. If no item is
|
|
given, one is instantiated automatically using the class in
|
|
:attr:`default_item_class`.
|
|
|
|
The item and the remaining keyword arguments are assigned to the Loader
|
|
context (accesible through the :attr:`context` attribute).
|
|
|
|
.. method:: get_value(value, \*processors, \**kwargs)
|
|
|
|
Process the given ``value`` by the given ``processors`` and keyword
|
|
arguments.
|
|
|
|
Available keyword arguments:
|
|
|
|
:param re: a regular expression to use for extracting data from the
|
|
given value using :meth:`~scrapy.utils.misc.extract_regex` method,
|
|
applied before processors
|
|
:type re: str or compiled regex
|
|
|
|
Examples::
|
|
|
|
>>> from scrapy.contrib.loader.processor import TakeFirst
|
|
>>> loader.get_value(u'name: foo', TakeFirst(), unicode.upper, re='name: (.+)')
|
|
'FOO`
|
|
|
|
.. method:: add_value(field_name, value, \*processors, \**kwargs)
|
|
|
|
Process and then add the given ``value`` for the given field.
|
|
|
|
The value is first passed through :meth:`get_value` by giving the
|
|
``processors`` and ``kwargs``, and then passed through the
|
|
:ref:`field input processor <topics-loaders-processors>` and its result
|
|
appened to the data collected for that field. If the field already
|
|
contains collected data, the new data is added.
|
|
|
|
The given ``field_name`` can be ``None``, in which case values for
|
|
multiple fields may be added. And the processed value should be a dict
|
|
with field_name mapped to values.
|
|
|
|
Examples::
|
|
|
|
loader.add_value('name', u'Color TV')
|
|
loader.add_value('colours', [u'white', u'blue'])
|
|
loader.add_value('length', u'100')
|
|
loader.add_value('name', u'name: foo', TakeFirst(), re='name: (.+)')
|
|
loader.add_value(None, {'name': u'foo', 'sex': u'male'})
|
|
|
|
.. method:: replace_value(field_name, value)
|
|
|
|
Similar to :meth:`add_value` but replaces the collected data with the
|
|
new value instead of adding it.
|
|
|
|
.. method:: load_item()
|
|
|
|
Populate the item with the data collected so far, and return it. The
|
|
data collected is first passed through the :ref:`output processors
|
|
<topics-loaders-processors>` to get the final value to assign to each
|
|
item field.
|
|
|
|
.. method:: get_collected_values(field_name)
|
|
|
|
Return the collected values for the given field.
|
|
|
|
.. method:: get_output_value(field_name)
|
|
|
|
Return the collected values parsed using the output processor, for the
|
|
given field. This method doesn't populate or modify the item at all.
|
|
|
|
.. method:: get_input_processor(field_name)
|
|
|
|
Return the input processor for the given field.
|
|
|
|
.. method:: get_output_processor(field_name)
|
|
|
|
Return the output processor for the given field.
|
|
|
|
.. attribute:: item
|
|
|
|
The :class:`~scrapy.item.Item` object being parsed by this Item Loader.
|
|
|
|
.. attribute:: context
|
|
|
|
The currently active :ref:`Context <topics-loaders-context>` of this
|
|
Item Loader.
|
|
|
|
.. attribute:: default_item_class
|
|
|
|
An Item class (or factory), used to instantiate items when not given in
|
|
the constructor.
|
|
|
|
.. attribute:: default_input_processor
|
|
|
|
The default input processor to use for those fields which don't specify
|
|
one.
|
|
|
|
.. attribute:: default_output_processor
|
|
|
|
The default output processor to use for those fields which don't specify
|
|
one.
|
|
|
|
.. class:: XPathItemLoader([item, selector, response], \**kwargs)
|
|
|
|
The :class:`XPathItemLoader` class extends the :class:`ItemLoader` class
|
|
providing more convenient mechanisms for extracting data from web pages
|
|
using :ref:`XPath selectors <topics-selectors>`.
|
|
|
|
:class:`XPathItemLoader` objects accept two more additional parameters in
|
|
their constructors:
|
|
|
|
:param selector: The selector to extract data from, when using the
|
|
:meth:`add_xpath` or :meth:`replace_xpath` method.
|
|
:type selector: :class:`~scrapy.selector.XPathSelector` object
|
|
|
|
:param response: The response used to construct the selector using the
|
|
:attr:`default_selector_class`, unless the selector argument is given,
|
|
in which case this argument is ignored.
|
|
:type response: :class:`~scrapy.http.Response` object
|
|
|
|
.. method:: get_xpath(xpath, \*processors, \**kwargs)
|
|
|
|
Similar to :meth:`ItemLoader.get_value` but receives an XPath instead of a
|
|
value, which is used to extract a list of unicode strings from the
|
|
selector associated with this :class:`XPathItemLoader`.
|
|
|
|
:param xpath: the XPath to extract data from
|
|
:type xpath: str
|
|
|
|
:param re: a regular expression to use for extracting data from the
|
|
selected XPath region
|
|
:type re: str or compiled regex
|
|
|
|
Examples::
|
|
|
|
# HTML snippet: <p class="product-name">Color TV</p>
|
|
loader.get_xpath('//p[@class="product-name"]')
|
|
# HTML snippet: <p id="price">the price is $1200</p>
|
|
loader.get_xpath('//p[@id="price"]', TakeFirst(), re='the price is (.*)')
|
|
|
|
.. method:: add_xpath(field_name, xpath, \*processors, \**kwargs)
|
|
|
|
Similar to :meth:`ItemLoader.add_value` but receives an XPath instead of a
|
|
value, which is used to extract a list of unicode strings from the
|
|
selector associated with this :class:`XPathItemLoader`.
|
|
|
|
See :meth:`get_xpath` for ``kwargs``.
|
|
|
|
:param xpath: the XPath to extract data from
|
|
:type xpath: str
|
|
|
|
Examples::
|
|
|
|
# HTML snippet: <p class="product-name">Color TV</p>
|
|
loader.add_xpath('name', '//p[@class="product-name"]')
|
|
# HTML snippet: <p id="price">the price is $1200</p>
|
|
loader.add_xpath('price', '//p[@id="price"]', re='the price is (.*)')
|
|
|
|
.. method:: replace_xpath(field_name, xpath, \*processors, \**kwargs)
|
|
|
|
Similar to :meth:`add_xpath` but replaces collected data instead of
|
|
adding it.
|
|
|
|
.. attribute:: default_selector_class
|
|
|
|
The class used to construct the :attr:`selector` of this
|
|
:class:`XPathItemLoader`, if only a response is given in the constructor.
|
|
If a selector is given in the constructor this attribute is ignored.
|
|
This attribute is sometimes overridden in subclasses.
|
|
|
|
.. attribute:: selector
|
|
|
|
The :class:`~scrapy.selector.XPathSelector` object to extract data from.
|
|
It's either the selector given in the constructor or one created from
|
|
the response given in the constructor using the
|
|
:attr:`default_selector_class`. This attribute is meant to be
|
|
read-only.
|
|
|
|
.. _topics-loaders-extending:
|
|
|
|
Reusing and extending Item Loaders
|
|
==================================
|
|
|
|
As your project grows bigger and acquires more and more spiders, maintenance
|
|
becomes a fundamental problem, specially when you have to deal with many
|
|
different parsing rules for each spider, having a lot of exceptions, but also
|
|
wanting to reuse the common processors.
|
|
|
|
Item Loaders are designed to ease the maintenance burden of parsing rules,
|
|
without losing flexibility and, at the same time, providing a convenient
|
|
mechanism for extending and overriding them. For this reason Item Loaders
|
|
support traditional Python class inheritance for dealing with differences of
|
|
specific spiders (or groups of spiders).
|
|
|
|
Suppose, for example, that some particular site encloses their product names in
|
|
three dashes (ie. ``---Plasma TV---``) and you don't want to end up scraping
|
|
those dashes in the final product names.
|
|
|
|
Here's how you can remove those dashes by reusing and extending the default
|
|
Product Item Loader (``ProductLoader``)::
|
|
|
|
from scrapy.contrib.loader.processor import MapCompose
|
|
from myproject.ItemLoaders import ProductLoader
|
|
|
|
def strip_dashes(x):
|
|
return x.strip('-')
|
|
|
|
class SiteSpecificLoader(ProductLoader):
|
|
name_in = MapCompose(strip_dashes, ProductLoader.name_in)
|
|
|
|
Another case where extending Item Loaders can be very helpful is when you have
|
|
multiple source formats, for example XML and HTML. In the XML version you may
|
|
want to remove ``CDATA`` occurrences. Here's an example of how to do it::
|
|
|
|
from scrapy.contrib.loader.processor import MapCompose
|
|
from myproject.ItemLoaders import ProductLoader
|
|
from myproject.utils.xml import remove_cdata
|
|
|
|
class XmlProductLoader(ProductLoader):
|
|
name_in = MapCompose(remove_cdata, ProductLoader.name_in)
|
|
|
|
And that's how you typically extend input processors.
|
|
|
|
As for output processors, it is more common to declare them in the field metadata,
|
|
as they usually depend only on the field and not on each specific site parsing
|
|
rule (as input processors do). See also:
|
|
:ref:`topics-loaders-processors-declaring`.
|
|
|
|
There are many other possible ways to extend, inherit and override your Item
|
|
Loaders, and different Item Loaders hierarchies may fit better for different
|
|
projects. Scrapy only provides the mechanism; it doesn't impose any specific
|
|
organization of your Loaders collection - that's up to you and your project's
|
|
needs.
|
|
|
|
.. _topics-loaders-available-processors:
|
|
|
|
Available built-in processors
|
|
=============================
|
|
|
|
.. module:: scrapy.contrib.loader.processor
|
|
:synopsis: A collection of processors to use with Item Loaders
|
|
|
|
Even though you can use any callable function as input and output processors,
|
|
Scrapy provides some commonly used processors, which are described below. Some
|
|
of them, like the :class:`MapCompose` (which is typically used as input
|
|
processor) compose the output of several functions executed in order, to
|
|
produce the final parsed value.
|
|
|
|
Here is a list of all built-in processors:
|
|
|
|
.. class:: Identity
|
|
|
|
The simplest processor, which doesn't do anything. It returns the original
|
|
values unchanged. It doesn't receive any constructor arguments nor accepts
|
|
Loader contexts.
|
|
|
|
Example::
|
|
|
|
>>> from scrapy.contrib.loader.processor import Identity
|
|
>>> proc = Identity()
|
|
>>> proc(['one', 'two', 'three'])
|
|
['one', 'two', 'three']
|
|
|
|
.. class:: TakeFirst
|
|
|
|
Return the first non-null/non-empty value from the values received,
|
|
so it's typically used as an output processor to single-valued fields.
|
|
It doesn't receive any constructor arguments, nor accept Loader contexts.
|
|
|
|
Example::
|
|
|
|
>>> from scrapy.contrib.loader.processor import TakeFirst
|
|
>>> proc = TakeFirst()
|
|
>>> proc(['', 'one', 'two', 'three'])
|
|
'one'
|
|
|
|
.. class:: Join(separator=u' ')
|
|
|
|
Returns the values joined with the separator given in the constructor, which
|
|
defaults to ``u' '``. It doesn't accept Loader contexts.
|
|
|
|
When using the default separator, this processor is equivalent to the
|
|
function: ``u' '.join``
|
|
|
|
Examples::
|
|
|
|
>>> from scrapy.contrib.loader.processor import Join
|
|
>>> proc = Join()
|
|
>>> proc(['one', 'two', 'three'])
|
|
u'one two three'
|
|
>>> proc = Join('<br>')
|
|
>>> proc(['one', 'two', 'three'])
|
|
u'one<br>two<br>three'
|
|
|
|
.. class:: Compose(\*functions, \**default_loader_context)
|
|
|
|
A processor which is constructed from the composition of the given
|
|
functions. This means that each input value of this processor is passed to
|
|
the first function, and the result of that function is passed to the second
|
|
function, and so on, until the last function returns the output value of
|
|
this processor.
|
|
|
|
By default, stop process on None value. This behaviour can be changed by
|
|
passing keyword argument stop_on_none=False.
|
|
|
|
Example::
|
|
|
|
>>> from scrapy.contrib.loader.processor import Compose
|
|
>>> proc = Compose(lambda v: v[0], str.upper)
|
|
>>> proc(['hello', 'world'])
|
|
'HELLO'
|
|
|
|
Each function can optionally receive a ``loader_context`` parameter. For
|
|
those which do, this processor will pass the currently active :ref:`Loader
|
|
context <topics-loaders-context>` through that parameter.
|
|
|
|
The keyword arguments passed in the constructor are used as the default
|
|
Loader context values passed to each function call. However, the final
|
|
Loader context values passed to functions are overridden with the currently
|
|
active Loader context accessible through the :meth:`ItemLoader.context`
|
|
attribute.
|
|
|
|
.. class:: MapCompose(\*functions, \**default_loader_context)
|
|
|
|
A processor which is constructed from the composition of the given
|
|
functions, similar to the :class:`Compose` processor. The difference with
|
|
this processor is the way internal results are passed among functions,
|
|
which is as follows:
|
|
|
|
The input value of this processor is *iterated* and each element is passed
|
|
to the first function, and the result of that function (for each element)
|
|
is concatenated to construct a new iterable, which is then passed to the
|
|
second function, and so on, until the last function is applied for each
|
|
value of the list of values collected so far. The output values of the last
|
|
function are concatenated together to produce the output of this processor.
|
|
|
|
Each particular function can return a value or a list of values, which is
|
|
flattened with the list of values returned by the same function applied to
|
|
the other input values. The functions can also return ``None`` in which
|
|
case the output of that function is ignored for further processing over the
|
|
chain.
|
|
|
|
This processor provides a convenient way to compose functions that only
|
|
work with single values (instead of iterables). For this reason the
|
|
:class:`MapCompose` processor is typically used as input processor, since
|
|
data is often extracted using the
|
|
:meth:`~scrapy.selector.XPathSelector.extract` method of :ref:`selectors
|
|
<topics-selectors>`, which returns a list of unicode strings.
|
|
|
|
The example below should clarify how it works::
|
|
|
|
>>> def filter_world(x):
|
|
... return None if x == 'world' else x
|
|
...
|
|
>>> from scrapy.contrib.loader.processor import MapCompose
|
|
>>> proc = MapCompose(filter_world, unicode.upper)
|
|
>>> proc([u'hello', u'world', u'this', u'is', u'scrapy'])
|
|
[u'HELLO, u'THIS', u'IS', u'SCRAPY']
|
|
|
|
As with the Compose processor, functions can receive Loader contexts, and
|
|
constructor keyword arguments are used as default context values. See
|
|
:class:`Compose` processor for more info.
|
|
|