mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-23 22:03:41 +00:00
--HG-- rename : docs/experimental/itemparser.rst => docs/experimental/loaders.rst rename : scrapy/contrib/itemparser/__init__.py => scrapy/contrib/loader/__init__.py rename : scrapy/contrib/itemparser/common.py => scrapy/contrib/loader/common.py rename : scrapy/contrib/itemparser/parsers.py => scrapy/contrib/loader/processor.py rename : scrapy/tests/test_itemparser.py => scrapy/tests/test_contrib_loader.py
528 lines
20 KiB
ReStructuredText
528 lines
20 KiB
ReStructuredText
.. _topics-loaders:
|
|
|
|
============
|
|
Item Loaders
|
|
============
|
|
|
|
.. module:: scrapy.contrib.loader
|
|
:synopsis: Item Loader class
|
|
|
|
Item Loaders provide a convenient mechanism for populating scraped :ref:`Items
|
|
<topics-newitems>`. Even though Items can be populated using their own
|
|
dictionary-like API, the Item Loaders provide a much more convenient API for
|
|
populating them from a scraping process, by automating some common tasks like
|
|
parsing the raw extracted data before assigning it.
|
|
|
|
In other words, :ref:`Items <topics-newitems>` provide the *container* of
|
|
scraped data, while Item Loaders provide the mechanism for *populating* that
|
|
container.
|
|
|
|
Item Loaders are designed to provide a flexible, efficient and easy mechanism
|
|
for extending and overriding different field parsing rules, either by spider,
|
|
or by source format (HTML, XML, etc) without becoming a nightmare to maintain.
|
|
|
|
Using Item Loaders to populate items
|
|
====================================
|
|
|
|
To use an Item Loader, you must first instantiate it. You can either
|
|
instantiate it with an Item object or without one, in which case an Item is
|
|
automatically instantiated in the Item Loader constructor using the Item class
|
|
specified in the :attr:`ItemLoader.default_item_class` attribute.
|
|
|
|
Then, you start collecting values into the Item Loader, typically using using
|
|
:ref:`XPath Selectors <topics-selectors>`. You can add more than one value to
|
|
the same item field, the Item Loader will know how to "join" those values later
|
|
using a proper processing function.
|
|
|
|
Here is a typical Item Loader usage in a :ref:`Spider <topics-spiders>`, using
|
|
the :ref:`Product item <topics-newitems-declaring>` declared in the :ref:`Items
|
|
chapter <topics-newitems>`::
|
|
|
|
from scrapy.contrib.loader import XPathItemLoader
|
|
from scrapy.xpath import HtmlXPathSelector
|
|
from myproject.items import Product
|
|
|
|
def parse(self, response):
|
|
p = XPathItemLoader(item=Product(), response=response)
|
|
p.add_xpath('name', '//div[@class="product_name"]')
|
|
p.add_xpath('name', '//div[@class="product_title"]')
|
|
p.add_xpath('price', '//p[@id="price"]')
|
|
p.add_xpath('stock', '//p[@id="stock"]')
|
|
p.add_value('last_updated', 'today') # you can also use literal values
|
|
return p.populate_item()
|
|
|
|
By quickly looking at that code we can see the ``name`` field is being
|
|
extracted from two different XPath locations in the page:
|
|
|
|
1. ``//div[@class="product_name"]``
|
|
2. ``//div[@class="product_title"]``
|
|
|
|
In other words, data is being collected by extracting it from two XPath
|
|
locations, using the :meth:`~XPathItemLoader.add_xpath` method. This is the data
|
|
that will be assigned to the ``name`` field later.
|
|
|
|
Afterwards, similar calls are used for ``price`` and ``stock`` fields, and
|
|
finally the ``last_update`` field is populated directly with a literal value
|
|
(``today``) using a different method: :meth:`~ItemLoader.add_value`.
|
|
|
|
Finally, when all data is collected, the :meth:`ItemLoader.populate_item`
|
|
method is called which actually populates and returns the item populated with
|
|
the data previously extracted and collected with the
|
|
:meth:`~XPathItemLoader.add_xpath` and :meth:`~ItemLoader.add_value` calls.
|
|
|
|
.. _topics-loaders-processors:
|
|
|
|
Input and Output processors
|
|
===========================
|
|
|
|
An Item Loader contains one input processor and one output processor for each
|
|
(item) field. The input processor processes the extracted data as soon as it's
|
|
received (through the :meth:`~XPathItemLoader.add_xpath` or
|
|
:meth:`~ItemLoader.add_value` methods) and the result of the input processor is
|
|
collected and kept inside the ItemLoader. After collecting all data, the
|
|
:meth:`ItemLoader.populate_item` method is called to populate and get the
|
|
populated :class:`~scrapy.newitem.Item` object. That's when the output processor
|
|
is called with the data previously collected (and processed using the input
|
|
processor). The result of the output processor is the final value that gets assigned
|
|
to the item.
|
|
|
|
Let's see an example to illustrate how this input and output processors are
|
|
called for a particular field (the same applies for any other field)::
|
|
|
|
p = XPathItemLoader(Product(), some_xpath_selector)
|
|
p.add_xpath('name', xpath1) # (1)
|
|
p.add_xpath('name', xpath2) # (2)
|
|
return p.populate_item() # (3)
|
|
|
|
So what happens is:
|
|
|
|
1. Data from ``xpath1`` is extracted, and passed through the *input processor* of
|
|
the ``name`` field. The result of the input processor is collected and kept in
|
|
the Item Loader (but not yet assigned to the item).
|
|
|
|
2. Data from ``xpath2`` is extracted, and passed through the same *input
|
|
processor* used in (1). The result of the input processor is appended to the
|
|
data collected in (1) (if any).
|
|
|
|
3. The data collected in (1) and (2) is passed through the *output processor* of
|
|
the ``name`` field. The result of the output processor is the value assigned to
|
|
the ``name`` field in the item.
|
|
|
|
It's worth noticing that processors are just callable objects, which are called
|
|
with the data to be parsed, and return a parsed value. So you can use any
|
|
function as input or output processor, provided they can receive only one
|
|
positional (required) argument.
|
|
|
|
The other thing you need to keep in mind is that the values returned by input
|
|
processors are collected internally (in lists) and then passed to output
|
|
processors to populate the fields, so output processors should expect iterables as
|
|
input.
|
|
|
|
Last, but not least, Scrapy comes with some :ref:`commonly used processors
|
|
<topics-loaders-available-processors>` built-in for convenience.
|
|
|
|
|
|
Declaring Item Loaders
|
|
======================
|
|
|
|
Item Loaders are declared like Items, by using a class definition syntax. Here
|
|
is an example::
|
|
|
|
from scrapy.contrib.loader import ItemLoader
|
|
from scrapy.contrib.loader.processor import TakeFirst, ApplyConcat, Join
|
|
|
|
class ProductLoader(ItemLoader):
|
|
|
|
default_input_processor = TakeFirst()
|
|
|
|
name_in = ApplyConcat(unicode.title)
|
|
name_out = Join()
|
|
|
|
price_in = ApplyConcat(unicode.strip)
|
|
price_out = TakeFirst()
|
|
|
|
# ...
|
|
|
|
As you can see, input processors are declared using the ``_in`` suffix while
|
|
output processors are declared using the ``_out`` suffix. And you can also
|
|
declare a default input/output processors using the
|
|
:attr:`ItemLoader.default_input_processor` and
|
|
:attr:`ItemLoader.default_output_processor` attributes.
|
|
|
|
.. _topics-loaders-processors-declaring:
|
|
|
|
Declaring Input and Output Processors
|
|
=====================================
|
|
|
|
As seen in the previous section, input and output processors can be declared in
|
|
the Item Loader definition, and it's very common to declare input processors
|
|
this way. However, there is one more place where you can specify the input and
|
|
output processors to use: in the :ref:`Item Field <topics-newitems-fields>`
|
|
metadata. Here is an example::
|
|
|
|
from scrapy.newitem import Item, Field
|
|
from scrapy.contrib.loader.processor import ApplyConcat, Join, TakeFirst
|
|
|
|
from scrapy.utils.markup import remove_entities
|
|
from myproject.utils import filter_prices
|
|
|
|
class Product(Item):
|
|
name = Field(
|
|
input_processor=ApplyConcat(remove_entities),
|
|
output_processor=Join(),
|
|
)
|
|
price = Field(
|
|
default=0,
|
|
input_processor=ApplyConcat(remove_entities, filter_prices),
|
|
output_processor=TakeFirst(),
|
|
)
|
|
|
|
The precedence order, for both input and output processors, is as follows:
|
|
|
|
1. Item Loader field-specific attributes: ``field_in`` and ``field_out`` (most
|
|
precedence)
|
|
2. Field metadata (``input_processor`` and ``output_processor`` key)
|
|
3. Item Loader defaults: :meth:`ItemLoader.default_input_processor` and
|
|
:meth:`ItemLoader.default_output_processor` (least precedence)
|
|
|
|
See also: :ref:`topics-loaders-extending`.
|
|
|
|
.. _topics-loaders-context:
|
|
|
|
Item Loader Context
|
|
===================
|
|
|
|
The Item Loader Context is a dict of arbitrary key/values which is shared among
|
|
all input and output processors in the Item Loader. It can be passed when
|
|
declaring, instantiating or using Item Loader. They are used to modify the
|
|
behaviour of the input/output processors.
|
|
|
|
For example, suppose you have a function ``parse_length`` which receives a text
|
|
value and extracts a length from it::
|
|
|
|
def parse_length(text, loader_context):
|
|
unit = loader_context.get('unit', 'm')
|
|
# ... length parsing code goes here ...
|
|
return parsed_length
|
|
|
|
By accepting a ``loader_context`` argument the function is explicitly telling
|
|
the Item Loader that is able to receive an Item Loader context, so the Item
|
|
Loader passes the currently active context when calling it, and the processor
|
|
function (``parse_length`` in this case) can thus use them.
|
|
|
|
There are several ways to modify Item Loader context values:
|
|
|
|
1. By modifying the currently active Item Loader context
|
|
(:meth:`ItemLoader.context` attribute)::
|
|
|
|
loader = ItemLoader(product, unit='cm')
|
|
loader.context['unit'] = 'cm'
|
|
|
|
2. On Item Loader instantiation (the keyword arguments of Item Loader
|
|
constructor are stored in the Item Loader context)::
|
|
|
|
p = ItemLoader(product, unit='cm')
|
|
|
|
2. On Item Loader declaration, for those input/output processors that support
|
|
instatiating them with a Item Loader context. :class:`ApplyConcat` is one of
|
|
them::
|
|
|
|
class ProductLoader(ItemLoader):
|
|
length_out = ApplyConcat(parse_length, unit='cm')
|
|
|
|
|
|
ItemLoader objects
|
|
==================
|
|
|
|
.. class:: ItemLoader([item], \**kwargs)
|
|
|
|
Return a new Item Loader for populating the given Item. If no item is
|
|
given, one is instantiated automatically using the class in
|
|
:attr:`default_item_class`.
|
|
|
|
The item and the remaining keyword arguments are assigned to the Loader
|
|
context (accesible through the :attr:`context` attribute).
|
|
|
|
.. method:: add_value(field_name, value)
|
|
|
|
Add the given ``value`` for the given field.
|
|
|
|
The value is passed through the :ref:`field input processor
|
|
<topics-loaders-processors>` and its result appened to the data
|
|
collected for that field. If the field already contains collected data,
|
|
the new data is added.
|
|
|
|
Examples::
|
|
|
|
loader.add_value('name', u'Color TV')
|
|
loader.add_value('colours', [u'white', u'blue'])
|
|
loader.add_value('length', u'100', default_unit='cm')
|
|
|
|
.. method:: replace_value(field_name, value)
|
|
|
|
Similar to :meth:`add_value` but replaces the collected data with the
|
|
new value instead of adding it.
|
|
|
|
.. method:: populate_item()
|
|
|
|
Populate the item with the data collected so far, and return it. The
|
|
data collected is first passed through the :ref:`field output processors
|
|
<topics-loaders-processors>` to get the final value to assign to each
|
|
item field.
|
|
|
|
.. method:: get_collected_values(field_name)
|
|
|
|
Return the collected values for the given field.
|
|
|
|
.. method:: get_output_value(field_name)
|
|
|
|
Return the collected values parsed using the output processor, for the
|
|
given field. This method doesn't populate or modify the item at all.
|
|
|
|
.. method:: get_input_processor(field_name)
|
|
|
|
Return the input processor for the given field.
|
|
|
|
.. method:: get_output_processor(field_name)
|
|
|
|
Return the output processor for the given field.
|
|
|
|
.. attribute:: item
|
|
|
|
The :class:`~scrapy.newitem.Item` object being parsed by this Item
|
|
Loader.
|
|
|
|
.. attribute:: context
|
|
|
|
The currently active :ref:`Context <topics-loaders-context>` of this
|
|
Item Loader.
|
|
|
|
.. attribute:: default_item_class
|
|
|
|
An Item class (or factory), used to instantiate items when not given in
|
|
the constructor.
|
|
|
|
.. attribute:: default_input_processor
|
|
|
|
The default input processor to use for those fields which don't specify
|
|
one.
|
|
|
|
.. attribute:: default_output_processor
|
|
|
|
The default output processor to use for those fields which don't specify
|
|
one.
|
|
|
|
.. class:: XPathItemLoader([item, selector, response], \**kwargs)
|
|
|
|
The :class:`XPathItemLoader` class extends the :class:`ItemLoader` class
|
|
providing more convenient mechanisms for extracting data from web pages
|
|
using :ref:`XPath selectors <topics-selectors>`.
|
|
|
|
:class:`XPathItemLoader` objects accept two more additional parameters in
|
|
their constructors:
|
|
|
|
:param selector: The selector to extract data from, when using the
|
|
:meth:`add_xpath` or :meth:`replace_xpath` method.
|
|
:type selector: :class:`~scrapy.xpath.XPathSelector` object
|
|
|
|
:param response: The response used to construct the selector using the
|
|
:attr:`default_selector_class`, unless the selector argument is given,
|
|
in which case this argument is ignored.
|
|
:type response: :class:`~scrapy.http.Response` object
|
|
|
|
.. method:: add_xpath(field_name, xpath, re=None)
|
|
|
|
Similar to :meth:`ItemLoader.add_value` but receives an XPath instead of a
|
|
value, which is used to extract a list of unicode strings from the
|
|
selector associated with this :class:`XPathItemLoader`. If the ``re``
|
|
argument is given, it's used for extrating data from the selector using
|
|
the :meth:`~scrapy.xpath.XPathSelector.re` method.
|
|
|
|
:param xpath: the XPath to extract data from
|
|
:type xpath: str
|
|
|
|
:param re: a regular expression to use for extracting data from the
|
|
selected XPath region
|
|
:type re: str or compiled regex
|
|
|
|
Examples::
|
|
|
|
# HTML snippet: <p class="product-name">Color TV</p>
|
|
loader.add_xpath('name', '//p[@class="product-name"]')
|
|
# HTML snippet: <p id="price">the price is $1200</p>
|
|
loader.add_xpath('price', '//p[@id="price"]', re='the price is (.*)')
|
|
|
|
.. method:: replace_xpath(field_name, xpath, re=None)
|
|
|
|
Similar to :meth:`add_xpath` but replaces collected data instead of
|
|
adding it.
|
|
|
|
.. attribute:: default_selector_class
|
|
|
|
The class used to construct the :attr:`selector` of this
|
|
:class:`XPathItemLoader`, if only a response is given in the constructor.
|
|
If a selector is given in the constructor this attribute is ignored.
|
|
This attribute is sometimes overridden in subclasses.
|
|
|
|
.. attribute:: selector
|
|
|
|
The :class:`~scrapy.xpath.XPathSelector` object to extract data from.
|
|
It's either the selector given in the constructor or one created from
|
|
the response given in the constructor using the
|
|
:attr:`default_selector_class`. This attribute is meant to be
|
|
read-only.
|
|
|
|
.. _topics-loaders-extending:
|
|
|
|
Reusing and extending Item Loaders
|
|
==================================
|
|
|
|
As your project grows bigger and acquires more and more spiders, maintenance
|
|
becomes a fundamental problem, specially when you have to deal with many
|
|
different parsing rules for each spider, having a lot of exceptions, but also
|
|
wanting to reuse the common processors.
|
|
|
|
Item Loaders are designed to ease the maintenance burden of parsing rules,
|
|
without loosing flexibility and, at the same time, providing a convenient
|
|
mechanism for extending and overriding them. For this reason Item Loaders
|
|
support traditional Python class inheritance for dealing with differences of
|
|
specific spiders (or group of spiders).
|
|
|
|
Suppose, for example, that some particular site encloses their product names in
|
|
three dashes (ie. ``---Plasma TV---``) and you don't want to end up scraping
|
|
those dashes in the final product names.
|
|
|
|
Here's how you can remove those dashes by reusing and extending the default
|
|
Product Item Loader (``ProductLoader``)::
|
|
|
|
from scrapy.contrib.loader.processor import ApplyConcat
|
|
from myproject.ItemLoaders import ProductLoader
|
|
|
|
def strip_dashes(x):
|
|
return x.strip('-')
|
|
|
|
class SiteSpecificLoader(ProductLoader):
|
|
name_in = ApplyConcat(ProductLoader.name_in, strip_dashes)
|
|
|
|
Another case where extending Item Loaders can be very helpful is when you have
|
|
multiple source formats, for example XML and HTML. In the XML version you may
|
|
want to remove ``CDATA`` occurrences. Here's an example of how to do it::
|
|
|
|
from scrapy.contrib.loader.processor import ApplyConcat
|
|
from myproject.ItemLoaders import ProductLoader
|
|
from myproject.utils.xml import remove_cdata
|
|
|
|
class XmlProductLoader(ProductLoader):
|
|
name_in = ApplyConcat(remove_cdata, ProductLoader.name_in)
|
|
|
|
And that's how you typically extend input processors.
|
|
|
|
As for output processors, it is more common to declare them in the field metadata,
|
|
as they usually depend only on the field and not on each specific site parsing
|
|
rule (as input processors do). See also:
|
|
:ref:`topics-loaders-processors-declaring`.
|
|
|
|
There are many other possible ways to extend, inherit and override your Item
|
|
Loaders, and different Item Loaders hierarchies may fit better for different
|
|
projects. Scrapy only provides the mechanism, it doesn't impose any specific
|
|
organization of your Loaders collection - that's up to you and your project
|
|
needs.
|
|
|
|
.. _topics-loaders-available-processors:
|
|
|
|
Available built-in processors
|
|
=============================
|
|
|
|
Even though you can use any callable function as input and output processors,
|
|
Scrapy provides some commonly used processors, which are described below. Some
|
|
of them, like the :class:`ApplyConcat` (which is typically used as input
|
|
processor) composes the output of several functions executed in order, to
|
|
produce the final parsed value.
|
|
|
|
Here is a list of all built-in processors:
|
|
|
|
.. _topics-loaders-applyconcat:
|
|
|
|
ApplyConcat processor
|
|
---------------------
|
|
|
|
The ApplyConcat processor is the recommended processor to use if you want to
|
|
concatenate the processing of several functions in a pipeline.
|
|
|
|
.. module:: scrapy.contrib.loader.processor
|
|
:synopsis: A collection of processors to use with Item Loaders
|
|
|
|
.. class:: ApplyConcat(\*functions, \**default_loader_context)
|
|
|
|
A processor which applies the given functions consecutively, in order,
|
|
concatenating their results before next function call. So each function
|
|
returns a list of values (though it could return ``None`` or a signle value
|
|
too) and the next function is called once for each of those values,
|
|
receiving one of those values as input each time. The output of each
|
|
function call (for each input value) is concatenated and each values of the
|
|
concatenation is used to call the next function, and the process repeats
|
|
until there are no functions left.
|
|
|
|
Each function can optionally receive a ``loader_context`` parameter, which
|
|
will contain the currently active :ref:`Item Loader context
|
|
<topics-loaders-context>`.
|
|
|
|
The keyword arguments passed in the consturctor are used as the default
|
|
Item Loader context values passed on each function call. However, the final
|
|
Item Loader context values passed to funtions get overriden with the
|
|
currently active Item Loader context accesible through the
|
|
:meth:`ItemLoader.context` attribute.
|
|
|
|
Example::
|
|
|
|
>>> def filter_world(x):
|
|
... return None if x == 'world' else x
|
|
...
|
|
>>> from scrapy.contrib.loader.processor import ApplyConcat
|
|
>>> proc = ApplyConcat(filter_world, str.upper)
|
|
>>> proc(['hello', 'world', 'this', 'is', 'scrapy'])
|
|
['HELLO, 'THIS', 'IS', 'SCRAPY']
|
|
|
|
.. class:: TakeFirst
|
|
|
|
Return the first non null/empty value from the values to received, so it's
|
|
typically used as output processor of single-valued fields. It doesn't
|
|
receive any constructor arguments, nor accepts a Item Loader context.
|
|
|
|
Example::
|
|
|
|
>>> from scrapy.contrib.loader.processor import TakeFirst
|
|
>>> proc = TakeFirst()
|
|
>>> proc(['', 'one', 'two', 'three'])
|
|
'one'
|
|
|
|
.. class:: Identity
|
|
|
|
Return the original values unchanged. It doesn't receive any constructor
|
|
arguments nor accepts a Item Loader context.
|
|
|
|
Example::
|
|
|
|
>>> from scrapy.contrib.loader.processor import Identity
|
|
>>> proc = Identity()
|
|
>>> proc(['one', 'two', 'three'])
|
|
['one', 'two', 'three']
|
|
|
|
.. class:: Join(separator=u' ')
|
|
|
|
Return the values joined with the separator given in the constructor, which
|
|
defaults to ``u' '``. It doesn't accept a Item Loader context.
|
|
|
|
When using the default separator, this processor is equivalent to the
|
|
function: ``u' '.join``
|
|
|
|
Examples::
|
|
|
|
>>> from scrapy.contrib.loader.processor import Join
|
|
>>> proc = Join()
|
|
>>> proc(['one', 'two', 'three'])
|
|
u'one two three'
|
|
>>> proc = Join('<br>')
|
|
>>> proc(['one', 'two', 'three'])
|
|
u'one<br>two<br>three'
|