mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-23 18:44:05 +00:00
Moved Item Loader to its final location in scrapy.contrib.loader, and updated doc/tests
--HG-- rename : docs/experimental/itemparser.rst => docs/experimental/loaders.rst rename : scrapy/contrib/itemparser/__init__.py => scrapy/contrib/loader/__init__.py rename : scrapy/contrib/itemparser/common.py => scrapy/contrib/loader/common.py rename : scrapy/contrib/itemparser/parsers.py => scrapy/contrib/loader/processor.py rename : scrapy/tests/test_itemparser.py => scrapy/tests/test_contrib_loader.py
This commit is contained in:
parent
7cbbc3ffb0
commit
1dc592882b
@ -1,526 +0,0 @@
|
||||
.. _topics-itemparser:
|
||||
|
||||
============
|
||||
Item Parsers
|
||||
============
|
||||
|
||||
.. module:: scrapy.contrib.itemparser
|
||||
:synopsis: Item Parser class
|
||||
|
||||
Item Parser provide a convenient mechanism for populating scraped :ref:`Items
|
||||
<topics-newitems>`. Even though Items can be populated using their own
|
||||
dictionary-like API, the Item Parsers provide a much more convenient API for
|
||||
populating them from a scraping process, by automating some common tasks like
|
||||
parsing the raw extracted data before assigning it.
|
||||
|
||||
In other words, :ref:`Items <topics-newitems>` provide the *container* of
|
||||
scraped data, while Item Parsers provide the mechanism for *populating* that
|
||||
container.
|
||||
|
||||
Item Parsers are designed to provide a flexible, efficient and easy mechanism
|
||||
for extending and overriding different field parsing rules, either by spider,
|
||||
or by source format (HTML, XML, etc) without becoming a nightmare to maintain.
|
||||
|
||||
Using Item Parsers to populate items
|
||||
====================================
|
||||
|
||||
To use an Item Parser, you must first instantiate it. You can either
|
||||
instantiate it with an Item object or without one, in which case an Item is
|
||||
automatically instantiated in the Item Parser constructor using the Item class
|
||||
specified in the :attr:`ItemParser.default_item_class` attribute.
|
||||
|
||||
Then, you start collecting values into the Item Parser, typically using using
|
||||
:ref:`XPath Selectors <topics-selectors>`. You can add more than one value to
|
||||
the same item field, the Item Parser will know how to "join" those values later
|
||||
using a proper parser function.
|
||||
|
||||
Here is a typical Item Parser usage in a :ref:`Spider <topics-spiders>`, using
|
||||
the :ref:`Product item <topics-newitems-declaring>` declared in the :ref:`Items
|
||||
chapter <topics-newitems>`::
|
||||
|
||||
from scrapy.contrib.itemparser import XPathItemParser
|
||||
from scrapy.xpath import HtmlXPathSelector
|
||||
from myproject.items import Product
|
||||
|
||||
def parse(self, response):
|
||||
p = XPathItemParser(item=Product(), response=response)
|
||||
p.add_xpath('name', '//div[@class="product_name"]')
|
||||
p.add_xpath('name', '//div[@class="product_title"]')
|
||||
p.add_xpath('price', '//p[@id="price"]')
|
||||
p.add_xpath('stock', '//p[@id="stock"]')
|
||||
p.add_value('last_updated', 'today') # you can also use literal values
|
||||
return p.populate_item()
|
||||
|
||||
By quickly looking at that code we can see the ``name`` field is being
|
||||
extracted from two different XPath locations in the page:
|
||||
|
||||
1. ``//div[@class="product_name"]``
|
||||
2. ``//div[@class="product_title"]``
|
||||
|
||||
In other words, data is being collected by extracting it from two XPath
|
||||
locations, using the :meth:`~XPathItemParser.add_xpath` method. This is the data
|
||||
that will be assigned to the ``name`` field later.
|
||||
|
||||
Afterwards, similar calls are used for ``price`` and ``stock`` fields, and
|
||||
finally the ``last_update`` field is populated directly with a literal value
|
||||
(``today``) using a different method: :meth:`~ItemParser.add_value`.
|
||||
|
||||
Finally, when all data is collected, the :meth:`ItemParser.populate_item`
|
||||
method is called which actually populates and returns the item populated with
|
||||
the data previously extracted and collected with the
|
||||
:meth:`~XPathItemParser.add_xpath` and :meth:`~ItemParser.add_value` calls.
|
||||
|
||||
.. _topics-itemparser-parsers:
|
||||
|
||||
Input and Output parsers
|
||||
========================
|
||||
|
||||
An Item Parser contains one input parser and one output parser for each (item)
|
||||
field. The input parser processes the extracted data as soon as it's received
|
||||
(through the :meth:`~XPathItemParser.add_xpath` or
|
||||
:meth:`~ItemParser.add_value` methods) and the result of the input parser is
|
||||
collected and kept inside the ItemParser. After collecting all data, the
|
||||
:meth:`ItemParser.populate_item` method is called to populate and get the
|
||||
populated :class:`~scrapy.newitem.Item` object. That's when the output parser
|
||||
is called with the data previously collected (and processed using the input
|
||||
parser). The result of the output parser is the final value that gets assigned
|
||||
to the item.
|
||||
|
||||
Let's see an example to illustrate how this input and output parsers are
|
||||
called for a particular field (the same applies for any other field)::
|
||||
|
||||
p = XPathItemParser(Product(), some_xpath_selector)
|
||||
p.add_xpath('name', xpath1) # (1)
|
||||
p.add_xpath('name', xpath2) # (2)
|
||||
return p.populate_item() # (3)
|
||||
|
||||
So what happens is:
|
||||
|
||||
1. Data from ``xpath1`` is extracted, and passed through the *input parser* of
|
||||
the ``name`` field. The result of the input parser is collected and kept in
|
||||
the Item Parser (but not yet assigned to the item).
|
||||
|
||||
2. Data from ``xpath2`` is extracted, and passed through the same *input
|
||||
parser* used in (1). The result of the input parser is appended to the data
|
||||
collected in (1) (if any).
|
||||
|
||||
3. The data collected in (1) and (2) is passed through the *output parser* of
|
||||
the ``name`` field. The result of the output parser is the value assigned to
|
||||
the ``name`` field in the item.
|
||||
|
||||
It's worth noticing that parsers are just callable objects, which are called
|
||||
with the data to be parsed, and return a parsed value. So you can use any
|
||||
function as input or output parser, provided they can receive only one
|
||||
positional (required) argument.
|
||||
|
||||
The other thing you need to keep in mind is that the values returned by input
|
||||
parsers are collected internally (in lists) and then passed to output parsers
|
||||
to populate the fields, so output parsers should expect iterables as input.
|
||||
|
||||
Last, but not least, Scrapy comes with some :ref:`commonly used parsers
|
||||
<topics-itemparser-available-parsers>` built-in for convenience.
|
||||
|
||||
|
||||
Declaring Item Parsers
|
||||
======================
|
||||
|
||||
Item Parsers are declared like Items, by using a class definition syntax. Here
|
||||
is an example::
|
||||
|
||||
from scrapy.contrib.itemparser import ItemParser
|
||||
from scrapy.contrib.itemparser.parsers import TakeFirst, ApplyConcat, Join
|
||||
|
||||
class ProductParser(ItemParser):
|
||||
|
||||
default_expander = TakeFirst()
|
||||
|
||||
name_in = ApplyConcat(unicode.title)
|
||||
name_out = Join()
|
||||
|
||||
price_in = ApplyConcat(unicode.strip)
|
||||
price_out = TakeFirst()
|
||||
|
||||
# ...
|
||||
|
||||
As you can see, input parsers are declared using the ``_in`` suffix while
|
||||
output parsers are declared using the ``_out`` suffix. And you can also declare
|
||||
a default input/output parsers using the
|
||||
:attr:`ItemParser.default_input_parser` and
|
||||
:attr:`ItemParser.default_output_parser` attributes.
|
||||
|
||||
.. _topics-itemparser-parsers-declaring:
|
||||
|
||||
Declaring Input and Output Parsers
|
||||
==================================
|
||||
|
||||
As seen in the previous section, input and output parsers can be declared in
|
||||
the Item Parser definition, and it's very common to declare input parsers this
|
||||
way. However, there is one more place where you can specify the input and
|
||||
output parsers to use: in the :ref:`Item Field <topics-newitems-fields>`
|
||||
metadata. Here is an example::
|
||||
|
||||
from scrapy.newitem import Item, Field
|
||||
from scrapy.contrib.itemparser.parser import ApplyConcat, Join, TakeFirst
|
||||
|
||||
from scrapy.utils.markup import remove_entities
|
||||
from myproject.utils import filter_prices
|
||||
|
||||
class Product(Item):
|
||||
name = Field(
|
||||
input_parser=ApplyConcat(remove_entities),
|
||||
output_parser=Join(),
|
||||
)
|
||||
price = Field(
|
||||
default=0,
|
||||
input_parser=ApplyConcat(remove_entities, filter_prices),
|
||||
output_parser=TakeFirst(),
|
||||
)
|
||||
|
||||
The precedence order, for both input and output parsers, is as follows:
|
||||
|
||||
1. Item Parser field-specific attributes: ``field_in`` and ``field_out`` (most
|
||||
precedence)
|
||||
2. Field metadata (``input_parser`` and ``output_parser`` key)
|
||||
3. Item Parser defaults: :meth:`ItemParser.default_expander` and
|
||||
:meth:`ItemParser.default_output_parser` (least precedence)
|
||||
|
||||
See also: :ref:`topics-itemparser-extending`.
|
||||
|
||||
.. _topics-itemparser-context:
|
||||
|
||||
Item Parser Context
|
||||
===================
|
||||
|
||||
The Item Parser Context is a dict of arbitrary key/values which is shared among
|
||||
all input and output parsers in the Item Parser. It can be passed when
|
||||
declaring, instantiating or using Item Parser. They are used to modify the
|
||||
behaviour of the input/output parsers.
|
||||
|
||||
For example, suppose you have a function ``parse_length`` which receives a text
|
||||
value and extracts a length from it::
|
||||
|
||||
def parse_length(text, parser_context):
|
||||
unit = parser_context.get('unit', 'm')
|
||||
# ... length parsing code goes here ...
|
||||
return parsed_length
|
||||
|
||||
By accepting a ``parser_context`` argument the function is explicitly telling
|
||||
the Item Parser that is able to receive an Item Parser context, so the Item
|
||||
Parser passes the currently active context when calling it, and the parser
|
||||
function (``parse_length`` in this case) can thus use them.
|
||||
|
||||
There are several ways to modify Item Parser context values:
|
||||
|
||||
1. By modifying the currently active Item Parser context
|
||||
(:meth:`ItemParser.context` attribute)::
|
||||
|
||||
parser = ItemParser(product, unit='cm')
|
||||
parser.context['unit'] = 'cm'
|
||||
|
||||
2. On Item Parser instantiation (the keyword arguments of Item Parser
|
||||
constructor are stored in the Item Parser context)::
|
||||
|
||||
p = ItemParser(product, unit='cm')
|
||||
|
||||
2. On Item Parser declaration, for those input/output parsers that support
|
||||
instatiating them with a Item Parser context. :class:`ApplyConcat` is one of
|
||||
them::
|
||||
|
||||
class ProductParser(ItemParser):
|
||||
length_out = ApplyConcat(parse_length, unit='cm')
|
||||
|
||||
|
||||
ItemParser objects
|
||||
==================
|
||||
|
||||
.. class:: ItemParser([item], \**kwargs)
|
||||
|
||||
Return a new Item Parser for populating the given Item. If no item is
|
||||
given, one is instantiated automatically using the class in
|
||||
:attr:`default_item_class`.
|
||||
|
||||
The item and the remaining keyword arguments are assigned to the Parser
|
||||
context (accesible through the :attr:`context` attribute).
|
||||
|
||||
.. method:: add_value(field_name, value)
|
||||
|
||||
Add the given ``value`` for the given field.
|
||||
|
||||
The value is passed through the :ref:`field input parser
|
||||
<topics-itemparser-parsers>` and its result appened to the data
|
||||
collected for that field. If the field already contains collected data,
|
||||
the new data is added.
|
||||
|
||||
Examples::
|
||||
|
||||
parser.add_value('name', u'Color TV')
|
||||
parser.add_value('colours', [u'white', u'blue'])
|
||||
parser.add_value('length', u'100', default_unit='cm')
|
||||
|
||||
.. method:: replace_value(field_name, value)
|
||||
|
||||
Similar to :meth:`add_value` but replaces the collected data with the
|
||||
new value instead of adding it.
|
||||
|
||||
.. method:: populate_item()
|
||||
|
||||
Populate the item with the data collected so far, and return it. The
|
||||
data collected is first passed through the :ref:`field output parsers
|
||||
<topics-itemparser-parsers>` to get the final value to assign to each
|
||||
item field.
|
||||
|
||||
.. method:: get_collected_values(field_name)
|
||||
|
||||
Return the collected values for the given field.
|
||||
|
||||
.. method:: get_output_value(field_name)
|
||||
|
||||
Return the collected values parsed using the output parser, for the
|
||||
given field. This method doesn't populate or modify the item at all.
|
||||
|
||||
.. method:: get_input_parser(field_name)
|
||||
|
||||
Return the input parser for the given field.
|
||||
|
||||
.. method:: get_output_parser(field_name)
|
||||
|
||||
Return the output parser for the given field.
|
||||
|
||||
.. attribute:: item
|
||||
|
||||
The :class:`~scrapy.newitem.Item` object being parsed by this Item
|
||||
Parser.
|
||||
|
||||
.. attribute:: context
|
||||
|
||||
The currently active :ref:`Context <topics-itemparser-context>` of this
|
||||
Item Parser.
|
||||
|
||||
.. attribute:: default_item_class
|
||||
|
||||
An Item class (or factory), used to instantiate items when not given in
|
||||
the constructor.
|
||||
|
||||
.. attribute:: default_input_parser
|
||||
|
||||
The default input parser to use for those fields which don't specify
|
||||
one.
|
||||
|
||||
.. attribute:: default_output_parser
|
||||
|
||||
The default output parser to use for those fields which don't specify
|
||||
one.
|
||||
|
||||
.. class:: XPathItemParser([item, selector, response], \**kwargs)
|
||||
|
||||
The :class:`XPathItemParser` class extends the :class:`ItemParser` class
|
||||
providing more convenient mechanisms for extracting data from web pages
|
||||
using :ref:`XPath selectors <topics-selectors>`.
|
||||
|
||||
:class:`XPathItemParser` objects accept two more additional parameters in
|
||||
their constructors:
|
||||
|
||||
:param selector: The selector to extract data from, when using the
|
||||
:meth:`add_xpath` or :meth:`replace_xpath` method.
|
||||
:type selector: :class:`~scrapy.xpath.XPathSelector` object
|
||||
|
||||
:param response: The response used to construct the selector using the
|
||||
:attr:`default_selector_class`, unless the selector argument is given,
|
||||
in which case this argument is ignored.
|
||||
:type response: :class:`~scrapy.http.Response` object
|
||||
|
||||
.. method:: add_xpath(field_name, xpath, re=None)
|
||||
|
||||
Similar to :meth:`ItemParser.add_value` but receives an XPath instead of a
|
||||
value, which is used to extract a list of unicode strings from the
|
||||
selector associated with this :class:`XPathItemParser`. If the ``re``
|
||||
argument is given, it's used for extrating data from the selector using
|
||||
the :meth:`~scrapy.xpath.XPathSelector.re` method.
|
||||
|
||||
:param xpath: the XPath to extract data from
|
||||
:type xpath: str
|
||||
|
||||
:param re: a regular expression to use for extracting data from the
|
||||
selected XPath region
|
||||
:type re: str or compiled regex
|
||||
|
||||
Examples::
|
||||
|
||||
# HTML snippet: <p class="product-name">Color TV</p>
|
||||
parser.add_xpath('name', '//p[@class="product-name"]')
|
||||
# HTML snippet: <p id="price">the price is $1200</p>
|
||||
parser.add_xpath('price', '//p[@id="price"]', re='the price is (.*)')
|
||||
|
||||
.. method:: replace_xpath(field_name, xpath, re=None)
|
||||
|
||||
Similar to :meth:`add_xpath` but replaces collected data instead of
|
||||
adding it.
|
||||
|
||||
.. attribute:: default_selector_class
|
||||
|
||||
The class used to construct the :attr:`selector` of this
|
||||
:class:`XPathItemParser`, if only a response is given in the constructor.
|
||||
If a selector is given in the constructor this attribute is ignored.
|
||||
This attribute is sometimes overridden in subclasses.
|
||||
|
||||
.. attribute:: selector
|
||||
|
||||
The :class:`~scrapy.xpath.XPathSelector` object to extract data from.
|
||||
It's either the selector given in the constructor or one created from
|
||||
the response given in the constructor using the
|
||||
:attr:`default_selector_class`. This attribute is meant to be
|
||||
read-only.
|
||||
|
||||
.. _topics-itemparser-extending:
|
||||
|
||||
Reusing and extending Item Parsers
|
||||
==================================
|
||||
|
||||
As your project grows bigger and acquires more and more spiders, maintenance
|
||||
becomes a fundamental problem, specially when you have to deal with many
|
||||
different parsing rules for each spider, having a lot of exceptions, but also
|
||||
wanting to reuse the common parsers.
|
||||
|
||||
Item Parsers are designed to ease the maintenance burden of parsing rules,
|
||||
without loosing flexibility and, at the same time, providing a convenient
|
||||
mechanism for extending and overriding them. For this reason Item Parsers
|
||||
support traditional Python class inheritance for dealing with differences of
|
||||
specific spiders (or group of spiders).
|
||||
|
||||
Suppose, for example, that some particular site encloses their product names in
|
||||
three dashes (ie. ``---Plasma TV---``) and you don't want to end up scraping
|
||||
those dashes in the final product names.
|
||||
|
||||
Here's how you can remove those dashes by reusing and extending the default
|
||||
Product Item Parser (``ProductParser``)::
|
||||
|
||||
from scrapy.contrib.itemparser.parsers import ApplyConcat
|
||||
from myproject.itemparsers import ProductParser
|
||||
|
||||
def strip_dashes(x):
|
||||
return x.strip('-')
|
||||
|
||||
class SiteSpecificParser(ProductParser):
|
||||
name_in = ApplyConcat(ProductParser.name_in, strip_dashes)
|
||||
|
||||
Another case where extending Item Parsers can be very helpful is when you have
|
||||
multiple source formats, for example XML and HTML. In the XML version you may
|
||||
want to remove ``CDATA`` occurrences. Here's an example of how to do it::
|
||||
|
||||
from scrapy.contrib.itemparser.parsers import ApplyConcat
|
||||
from myproject.itemparsers import ProductParser
|
||||
from myproject.utils.xml import remove_cdata
|
||||
|
||||
class XmlProductParser(ProductParser):
|
||||
name_in = ApplyConcat(remove_cdata, ProductParser.name_in)
|
||||
|
||||
And that's how you typically extend input parsers.
|
||||
|
||||
As for output parsers, it is more common to declare them in the field metadata,
|
||||
as they usually depend only on the field and not on each specific site parsing
|
||||
rule (as input parsers do). See also:
|
||||
:ref:`topics-itemparser-parsers-declaring`.
|
||||
|
||||
There are many other possible ways to extend, inherit and override your Item
|
||||
Parsers, and different Item Parsers hierarchies may fit better for different
|
||||
projects. Scrapy only provides the mechanism, it doesn't impose any specific
|
||||
organization of your Parsers collection - that's up to you and your project
|
||||
needs.
|
||||
|
||||
.. _topics-itemparser-available-parsers:
|
||||
|
||||
Available built-in parsers
|
||||
==========================
|
||||
|
||||
Even though you can use any callable function as input and output parsers,
|
||||
Scrapy provides some commonly used parsers, which are described below. Some of
|
||||
them, like the :class:`ApplyConcat` (which is typically used as input parser)
|
||||
composes the output of several functions executed in order, to produce the
|
||||
final parsed value.
|
||||
|
||||
Here is a list of all built-in parsers:
|
||||
|
||||
.. _topics-itemparser-Applyconcat:
|
||||
|
||||
ApplyConcat parser
|
||||
------------------
|
||||
|
||||
The ApplyConcat parser is the recommended parser to use if you want to
|
||||
concatenate the processing of several functions in a pipeline.
|
||||
|
||||
.. module:: scrapy.contrib.itemparser.parsers
|
||||
:synopsis: Parser functions to use with Item Parsers
|
||||
|
||||
.. class:: ApplyConcat(\*functions, \**default_parser_context)
|
||||
|
||||
A parser which applies the given functions consecutively, in order,
|
||||
concatenating their results before next function call. So each function
|
||||
returns a list of values (though it could return ``None`` or a signle value
|
||||
too) and the next function is called once for each of those values,
|
||||
receiving one of those values as input each time. The output of each
|
||||
function call (for each input value) is concatenated and each values of the
|
||||
concatenation is used to call the next function, and the process repeats
|
||||
until there are no functions left.
|
||||
|
||||
Each function can optionally receive a ``parser_context`` parameter, which
|
||||
will contain the currently active :ref:`Item Parser context
|
||||
<topics-itemparser-context>`.
|
||||
|
||||
The keyword arguments passed in the consturctor are used as the default
|
||||
Item Parser context values passed on each function call. However, the final
|
||||
Item Parser context values passed to funtions get overriden with the
|
||||
currently active Item Parser context accesible through the
|
||||
:meth:`ItemParser.context` attribute.
|
||||
|
||||
Example::
|
||||
|
||||
>>> def filter_world(x):
|
||||
... return None if x == 'world' else x
|
||||
...
|
||||
>>> from scrapy.contrib.itemparser.parsers import ApplyConcat
|
||||
>>> parser = ApplyConcat(filter_world, str.upper)
|
||||
>>> parser(['hello', 'world', 'this', 'is', 'scrapy'])
|
||||
['HELLO, 'THIS', 'IS', 'SCRAPY']
|
||||
|
||||
.. class:: TakeFirst
|
||||
|
||||
Return the first non null/empty value from the values to received, so it's
|
||||
typically used as output parser of single-valued fields. It doesn't receive
|
||||
any constructor arguments, nor accepts a Item Parser context.
|
||||
|
||||
Example::
|
||||
|
||||
>>> from scrapy.contrib.itemparser.parsers import TakeFirst
|
||||
>>> parser = TakeFirst()
|
||||
>>> parser(['', 'one', 'two', 'three'])
|
||||
'one'
|
||||
|
||||
.. class:: Identity
|
||||
|
||||
Return the original values unchanged. It doesn't receive any constructor
|
||||
arguments nor accepts a Item Parser context.
|
||||
|
||||
Example::
|
||||
|
||||
>>> from scrapy.contrib.itemparser.parsers import Identity
|
||||
>>> parser = Identity()
|
||||
>>> parser(['one', 'two', 'three'])
|
||||
['one', 'two', 'three']
|
||||
|
||||
.. class:: Join(separator=u' ')
|
||||
|
||||
Return the values joined with the separator given in the constructor, which
|
||||
defaults to ``u' '``. It doesn't accept a Item Parser context.
|
||||
|
||||
When using the default separator, this parser is equivalent to the
|
||||
function: ``u' '.join``
|
||||
|
||||
Examples::
|
||||
|
||||
>>> from scrapy.contrib.itemparser.parsers import Join
|
||||
>>> parser = Join()
|
||||
>>> parser(['one', 'two', 'three'])
|
||||
u'one two three'
|
||||
>>> parser = Join('<br>')
|
||||
>>> parser(['one', 'two', 'three'])
|
||||
u'one<br>two<br>three'
|
527
docs/experimental/loaders.rst
Normal file
527
docs/experimental/loaders.rst
Normal file
@ -0,0 +1,527 @@
|
||||
.. _topics-loaders:
|
||||
|
||||
============
|
||||
Item Loaders
|
||||
============
|
||||
|
||||
.. module:: scrapy.contrib.loader
|
||||
:synopsis: Item Loader class
|
||||
|
||||
Item Loaders provide a convenient mechanism for populating scraped :ref:`Items
|
||||
<topics-newitems>`. Even though Items can be populated using their own
|
||||
dictionary-like API, the Item Loaders provide a much more convenient API for
|
||||
populating them from a scraping process, by automating some common tasks like
|
||||
parsing the raw extracted data before assigning it.
|
||||
|
||||
In other words, :ref:`Items <topics-newitems>` provide the *container* of
|
||||
scraped data, while Item Loaders provide the mechanism for *populating* that
|
||||
container.
|
||||
|
||||
Item Loaders are designed to provide a flexible, efficient and easy mechanism
|
||||
for extending and overriding different field parsing rules, either by spider,
|
||||
or by source format (HTML, XML, etc) without becoming a nightmare to maintain.
|
||||
|
||||
Using Item Loaders to populate items
|
||||
====================================
|
||||
|
||||
To use an Item Loader, you must first instantiate it. You can either
|
||||
instantiate it with an Item object or without one, in which case an Item is
|
||||
automatically instantiated in the Item Loader constructor using the Item class
|
||||
specified in the :attr:`ItemLoader.default_item_class` attribute.
|
||||
|
||||
Then, you start collecting values into the Item Loader, typically using using
|
||||
:ref:`XPath Selectors <topics-selectors>`. You can add more than one value to
|
||||
the same item field, the Item Loader will know how to "join" those values later
|
||||
using a proper processing function.
|
||||
|
||||
Here is a typical Item Loader usage in a :ref:`Spider <topics-spiders>`, using
|
||||
the :ref:`Product item <topics-newitems-declaring>` declared in the :ref:`Items
|
||||
chapter <topics-newitems>`::
|
||||
|
||||
from scrapy.contrib.loader import XPathItemLoader
|
||||
from scrapy.xpath import HtmlXPathSelector
|
||||
from myproject.items import Product
|
||||
|
||||
def parse(self, response):
|
||||
p = XPathItemLoader(item=Product(), response=response)
|
||||
p.add_xpath('name', '//div[@class="product_name"]')
|
||||
p.add_xpath('name', '//div[@class="product_title"]')
|
||||
p.add_xpath('price', '//p[@id="price"]')
|
||||
p.add_xpath('stock', '//p[@id="stock"]')
|
||||
p.add_value('last_updated', 'today') # you can also use literal values
|
||||
return p.populate_item()
|
||||
|
||||
By quickly looking at that code we can see the ``name`` field is being
|
||||
extracted from two different XPath locations in the page:
|
||||
|
||||
1. ``//div[@class="product_name"]``
|
||||
2. ``//div[@class="product_title"]``
|
||||
|
||||
In other words, data is being collected by extracting it from two XPath
|
||||
locations, using the :meth:`~XPathItemLoader.add_xpath` method. This is the data
|
||||
that will be assigned to the ``name`` field later.
|
||||
|
||||
Afterwards, similar calls are used for ``price`` and ``stock`` fields, and
|
||||
finally the ``last_update`` field is populated directly with a literal value
|
||||
(``today``) using a different method: :meth:`~ItemLoader.add_value`.
|
||||
|
||||
Finally, when all data is collected, the :meth:`ItemLoader.populate_item`
|
||||
method is called which actually populates and returns the item populated with
|
||||
the data previously extracted and collected with the
|
||||
:meth:`~XPathItemLoader.add_xpath` and :meth:`~ItemLoader.add_value` calls.
|
||||
|
||||
.. _topics-loaders-processors:
|
||||
|
||||
Input and Output processors
|
||||
===========================
|
||||
|
||||
An Item Loader contains one input processor and one output processor for each
|
||||
(item) field. The input processor processes the extracted data as soon as it's
|
||||
received (through the :meth:`~XPathItemLoader.add_xpath` or
|
||||
:meth:`~ItemLoader.add_value` methods) and the result of the input processor is
|
||||
collected and kept inside the ItemLoader. After collecting all data, the
|
||||
:meth:`ItemLoader.populate_item` method is called to populate and get the
|
||||
populated :class:`~scrapy.newitem.Item` object. That's when the output processor
|
||||
is called with the data previously collected (and processed using the input
|
||||
processor). The result of the output processor is the final value that gets assigned
|
||||
to the item.
|
||||
|
||||
Let's see an example to illustrate how this input and output processors are
|
||||
called for a particular field (the same applies for any other field)::
|
||||
|
||||
p = XPathItemLoader(Product(), some_xpath_selector)
|
||||
p.add_xpath('name', xpath1) # (1)
|
||||
p.add_xpath('name', xpath2) # (2)
|
||||
return p.populate_item() # (3)
|
||||
|
||||
So what happens is:
|
||||
|
||||
1. Data from ``xpath1`` is extracted, and passed through the *input processor* of
|
||||
the ``name`` field. The result of the input processor is collected and kept in
|
||||
the Item Loader (but not yet assigned to the item).
|
||||
|
||||
2. Data from ``xpath2`` is extracted, and passed through the same *input
|
||||
processor* used in (1). The result of the input processor is appended to the
|
||||
data collected in (1) (if any).
|
||||
|
||||
3. The data collected in (1) and (2) is passed through the *output processor* of
|
||||
the ``name`` field. The result of the output processor is the value assigned to
|
||||
the ``name`` field in the item.
|
||||
|
||||
It's worth noticing that processors are just callable objects, which are called
|
||||
with the data to be parsed, and return a parsed value. So you can use any
|
||||
function as input or output processor, provided they can receive only one
|
||||
positional (required) argument.
|
||||
|
||||
The other thing you need to keep in mind is that the values returned by input
|
||||
processors are collected internally (in lists) and then passed to output
|
||||
processors to populate the fields, so output processors should expect iterables as
|
||||
input.
|
||||
|
||||
Last, but not least, Scrapy comes with some :ref:`commonly used processors
|
||||
<topics-loaders-available-processors>` built-in for convenience.
|
||||
|
||||
|
||||
Declaring Item Loaders
|
||||
======================
|
||||
|
||||
Item Loaders are declared like Items, by using a class definition syntax. Here
|
||||
is an example::
|
||||
|
||||
from scrapy.contrib.loader import ItemLoader
|
||||
from scrapy.contrib.loader.processor import TakeFirst, ApplyConcat, Join
|
||||
|
||||
class ProductLoader(ItemLoader):
|
||||
|
||||
default_input_processor = TakeFirst()
|
||||
|
||||
name_in = ApplyConcat(unicode.title)
|
||||
name_out = Join()
|
||||
|
||||
price_in = ApplyConcat(unicode.strip)
|
||||
price_out = TakeFirst()
|
||||
|
||||
# ...
|
||||
|
||||
As you can see, input processors are declared using the ``_in`` suffix while
|
||||
output processors are declared using the ``_out`` suffix. And you can also
|
||||
declare a default input/output processors using the
|
||||
:attr:`ItemLoader.default_input_processor` and
|
||||
:attr:`ItemLoader.default_output_processor` attributes.
|
||||
|
||||
.. _topics-loaders-processors-declaring:
|
||||
|
||||
Declaring Input and Output Processors
|
||||
=====================================
|
||||
|
||||
As seen in the previous section, input and output processors can be declared in
|
||||
the Item Loader definition, and it's very common to declare input processors
|
||||
this way. However, there is one more place where you can specify the input and
|
||||
output processors to use: in the :ref:`Item Field <topics-newitems-fields>`
|
||||
metadata. Here is an example::
|
||||
|
||||
from scrapy.newitem import Item, Field
|
||||
from scrapy.contrib.loader.processor import ApplyConcat, Join, TakeFirst
|
||||
|
||||
from scrapy.utils.markup import remove_entities
|
||||
from myproject.utils import filter_prices
|
||||
|
||||
class Product(Item):
|
||||
name = Field(
|
||||
input_processor=ApplyConcat(remove_entities),
|
||||
output_processor=Join(),
|
||||
)
|
||||
price = Field(
|
||||
default=0,
|
||||
input_processor=ApplyConcat(remove_entities, filter_prices),
|
||||
output_processor=TakeFirst(),
|
||||
)
|
||||
|
||||
The precedence order, for both input and output processors, is as follows:
|
||||
|
||||
1. Item Loader field-specific attributes: ``field_in`` and ``field_out`` (most
|
||||
precedence)
|
||||
2. Field metadata (``input_processor`` and ``output_processor`` key)
|
||||
3. Item Loader defaults: :meth:`ItemLoader.default_input_processor` and
|
||||
:meth:`ItemLoader.default_output_processor` (least precedence)
|
||||
|
||||
See also: :ref:`topics-loaders-extending`.
|
||||
|
||||
.. _topics-loaders-context:
|
||||
|
||||
Item Loader Context
|
||||
===================
|
||||
|
||||
The Item Loader Context is a dict of arbitrary key/values which is shared among
|
||||
all input and output processors in the Item Loader. It can be passed when
|
||||
declaring, instantiating or using Item Loader. They are used to modify the
|
||||
behaviour of the input/output processors.
|
||||
|
||||
For example, suppose you have a function ``parse_length`` which receives a text
|
||||
value and extracts a length from it::
|
||||
|
||||
def parse_length(text, loader_context):
|
||||
unit = loader_context.get('unit', 'm')
|
||||
# ... length parsing code goes here ...
|
||||
return parsed_length
|
||||
|
||||
By accepting a ``loader_context`` argument the function is explicitly telling
|
||||
the Item Loader that is able to receive an Item Loader context, so the Item
|
||||
Loader passes the currently active context when calling it, and the processor
|
||||
function (``parse_length`` in this case) can thus use them.
|
||||
|
||||
There are several ways to modify Item Loader context values:
|
||||
|
||||
1. By modifying the currently active Item Loader context
|
||||
(:meth:`ItemLoader.context` attribute)::
|
||||
|
||||
loader = ItemLoader(product, unit='cm')
|
||||
loader.context['unit'] = 'cm'
|
||||
|
||||
2. On Item Loader instantiation (the keyword arguments of Item Loader
|
||||
constructor are stored in the Item Loader context)::
|
||||
|
||||
p = ItemLoader(product, unit='cm')
|
||||
|
||||
2. On Item Loader declaration, for those input/output processors that support
|
||||
instatiating them with a Item Loader context. :class:`ApplyConcat` is one of
|
||||
them::
|
||||
|
||||
class ProductLoader(ItemLoader):
|
||||
length_out = ApplyConcat(parse_length, unit='cm')
|
||||
|
||||
|
||||
ItemLoader objects
|
||||
==================
|
||||
|
||||
.. class:: ItemLoader([item], \**kwargs)
|
||||
|
||||
Return a new Item Loader for populating the given Item. If no item is
|
||||
given, one is instantiated automatically using the class in
|
||||
:attr:`default_item_class`.
|
||||
|
||||
The item and the remaining keyword arguments are assigned to the Loader
|
||||
context (accesible through the :attr:`context` attribute).
|
||||
|
||||
.. method:: add_value(field_name, value)
|
||||
|
||||
Add the given ``value`` for the given field.
|
||||
|
||||
The value is passed through the :ref:`field input processor
|
||||
<topics-loaders-processors>` and its result appened to the data
|
||||
collected for that field. If the field already contains collected data,
|
||||
the new data is added.
|
||||
|
||||
Examples::
|
||||
|
||||
loader.add_value('name', u'Color TV')
|
||||
loader.add_value('colours', [u'white', u'blue'])
|
||||
loader.add_value('length', u'100', default_unit='cm')
|
||||
|
||||
.. method:: replace_value(field_name, value)
|
||||
|
||||
Similar to :meth:`add_value` but replaces the collected data with the
|
||||
new value instead of adding it.
|
||||
|
||||
.. method:: populate_item()
|
||||
|
||||
Populate the item with the data collected so far, and return it. The
|
||||
data collected is first passed through the :ref:`field output processors
|
||||
<topics-loaders-processors>` to get the final value to assign to each
|
||||
item field.
|
||||
|
||||
.. method:: get_collected_values(field_name)
|
||||
|
||||
Return the collected values for the given field.
|
||||
|
||||
.. method:: get_output_value(field_name)
|
||||
|
||||
Return the collected values parsed using the output processor, for the
|
||||
given field. This method doesn't populate or modify the item at all.
|
||||
|
||||
.. method:: get_input_processor(field_name)
|
||||
|
||||
Return the input processor for the given field.
|
||||
|
||||
.. method:: get_output_processor(field_name)
|
||||
|
||||
Return the output processor for the given field.
|
||||
|
||||
.. attribute:: item
|
||||
|
||||
The :class:`~scrapy.newitem.Item` object being parsed by this Item
|
||||
Loader.
|
||||
|
||||
.. attribute:: context
|
||||
|
||||
The currently active :ref:`Context <topics-loaders-context>` of this
|
||||
Item Loader.
|
||||
|
||||
.. attribute:: default_item_class
|
||||
|
||||
An Item class (or factory), used to instantiate items when not given in
|
||||
the constructor.
|
||||
|
||||
.. attribute:: default_input_processor
|
||||
|
||||
The default input processor to use for those fields which don't specify
|
||||
one.
|
||||
|
||||
.. attribute:: default_output_processor
|
||||
|
||||
The default output processor to use for those fields which don't specify
|
||||
one.
|
||||
|
||||
.. class:: XPathItemLoader([item, selector, response], \**kwargs)
|
||||
|
||||
The :class:`XPathItemLoader` class extends the :class:`ItemLoader` class
|
||||
providing more convenient mechanisms for extracting data from web pages
|
||||
using :ref:`XPath selectors <topics-selectors>`.
|
||||
|
||||
:class:`XPathItemLoader` objects accept two more additional parameters in
|
||||
their constructors:
|
||||
|
||||
:param selector: The selector to extract data from, when using the
|
||||
:meth:`add_xpath` or :meth:`replace_xpath` method.
|
||||
:type selector: :class:`~scrapy.xpath.XPathSelector` object
|
||||
|
||||
:param response: The response used to construct the selector using the
|
||||
:attr:`default_selector_class`, unless the selector argument is given,
|
||||
in which case this argument is ignored.
|
||||
:type response: :class:`~scrapy.http.Response` object
|
||||
|
||||
.. method:: add_xpath(field_name, xpath, re=None)
|
||||
|
||||
Similar to :meth:`ItemLoader.add_value` but receives an XPath instead of a
|
||||
value, which is used to extract a list of unicode strings from the
|
||||
selector associated with this :class:`XPathItemLoader`. If the ``re``
|
||||
argument is given, it's used for extrating data from the selector using
|
||||
the :meth:`~scrapy.xpath.XPathSelector.re` method.
|
||||
|
||||
:param xpath: the XPath to extract data from
|
||||
:type xpath: str
|
||||
|
||||
:param re: a regular expression to use for extracting data from the
|
||||
selected XPath region
|
||||
:type re: str or compiled regex
|
||||
|
||||
Examples::
|
||||
|
||||
# HTML snippet: <p class="product-name">Color TV</p>
|
||||
loader.add_xpath('name', '//p[@class="product-name"]')
|
||||
# HTML snippet: <p id="price">the price is $1200</p>
|
||||
loader.add_xpath('price', '//p[@id="price"]', re='the price is (.*)')
|
||||
|
||||
.. method:: replace_xpath(field_name, xpath, re=None)
|
||||
|
||||
Similar to :meth:`add_xpath` but replaces collected data instead of
|
||||
adding it.
|
||||
|
||||
.. attribute:: default_selector_class
|
||||
|
||||
The class used to construct the :attr:`selector` of this
|
||||
:class:`XPathItemLoader`, if only a response is given in the constructor.
|
||||
If a selector is given in the constructor this attribute is ignored.
|
||||
This attribute is sometimes overridden in subclasses.
|
||||
|
||||
.. attribute:: selector
|
||||
|
||||
The :class:`~scrapy.xpath.XPathSelector` object to extract data from.
|
||||
It's either the selector given in the constructor or one created from
|
||||
the response given in the constructor using the
|
||||
:attr:`default_selector_class`. This attribute is meant to be
|
||||
read-only.
|
||||
|
||||
.. _topics-loaders-extending:
|
||||
|
||||
Reusing and extending Item Loaders
|
||||
==================================
|
||||
|
||||
As your project grows bigger and acquires more and more spiders, maintenance
|
||||
becomes a fundamental problem, specially when you have to deal with many
|
||||
different parsing rules for each spider, having a lot of exceptions, but also
|
||||
wanting to reuse the common processors.
|
||||
|
||||
Item Loaders are designed to ease the maintenance burden of parsing rules,
|
||||
without loosing flexibility and, at the same time, providing a convenient
|
||||
mechanism for extending and overriding them. For this reason Item Loaders
|
||||
support traditional Python class inheritance for dealing with differences of
|
||||
specific spiders (or group of spiders).
|
||||
|
||||
Suppose, for example, that some particular site encloses their product names in
|
||||
three dashes (ie. ``---Plasma TV---``) and you don't want to end up scraping
|
||||
those dashes in the final product names.
|
||||
|
||||
Here's how you can remove those dashes by reusing and extending the default
|
||||
Product Item Loader (``ProductLoader``)::
|
||||
|
||||
from scrapy.contrib.loader.processor import ApplyConcat
|
||||
from myproject.ItemLoaders import ProductLoader
|
||||
|
||||
def strip_dashes(x):
|
||||
return x.strip('-')
|
||||
|
||||
class SiteSpecificLoader(ProductLoader):
|
||||
name_in = ApplyConcat(ProductLoader.name_in, strip_dashes)
|
||||
|
||||
Another case where extending Item Loaders can be very helpful is when you have
|
||||
multiple source formats, for example XML and HTML. In the XML version you may
|
||||
want to remove ``CDATA`` occurrences. Here's an example of how to do it::
|
||||
|
||||
from scrapy.contrib.loader.processor import ApplyConcat
|
||||
from myproject.ItemLoaders import ProductLoader
|
||||
from myproject.utils.xml import remove_cdata
|
||||
|
||||
class XmlProductLoader(ProductLoader):
|
||||
name_in = ApplyConcat(remove_cdata, ProductLoader.name_in)
|
||||
|
||||
And that's how you typically extend input processors.
|
||||
|
||||
As for output processors, it is more common to declare them in the field metadata,
|
||||
as they usually depend only on the field and not on each specific site parsing
|
||||
rule (as input processors do). See also:
|
||||
:ref:`topics-loaders-processors-declaring`.
|
||||
|
||||
There are many other possible ways to extend, inherit and override your Item
|
||||
Loaders, and different Item Loaders hierarchies may fit better for different
|
||||
projects. Scrapy only provides the mechanism, it doesn't impose any specific
|
||||
organization of your Loaders collection - that's up to you and your project
|
||||
needs.
|
||||
|
||||
.. _topics-loaders-available-processors:
|
||||
|
||||
Available built-in processors
|
||||
=============================
|
||||
|
||||
Even though you can use any callable function as input and output processors,
|
||||
Scrapy provides some commonly used processors, which are described below. Some
|
||||
of them, like the :class:`ApplyConcat` (which is typically used as input
|
||||
processor) composes the output of several functions executed in order, to
|
||||
produce the final parsed value.
|
||||
|
||||
Here is a list of all built-in processors:
|
||||
|
||||
.. _topics-loaders-applyconcat:
|
||||
|
||||
ApplyConcat processor
|
||||
---------------------
|
||||
|
||||
The ApplyConcat processor is the recommended processor to use if you want to
|
||||
concatenate the processing of several functions in a pipeline.
|
||||
|
||||
.. module:: scrapy.contrib.loader.processor
|
||||
:synopsis: A collection of processors to use with Item Loaders
|
||||
|
||||
.. class:: ApplyConcat(\*functions, \**default_loader_context)
|
||||
|
||||
A processor which applies the given functions consecutively, in order,
|
||||
concatenating their results before next function call. So each function
|
||||
returns a list of values (though it could return ``None`` or a signle value
|
||||
too) and the next function is called once for each of those values,
|
||||
receiving one of those values as input each time. The output of each
|
||||
function call (for each input value) is concatenated and each values of the
|
||||
concatenation is used to call the next function, and the process repeats
|
||||
until there are no functions left.
|
||||
|
||||
Each function can optionally receive a ``loader_context`` parameter, which
|
||||
will contain the currently active :ref:`Item Loader context
|
||||
<topics-loaders-context>`.
|
||||
|
||||
The keyword arguments passed in the consturctor are used as the default
|
||||
Item Loader context values passed on each function call. However, the final
|
||||
Item Loader context values passed to funtions get overriden with the
|
||||
currently active Item Loader context accesible through the
|
||||
:meth:`ItemLoader.context` attribute.
|
||||
|
||||
Example::
|
||||
|
||||
>>> def filter_world(x):
|
||||
... return None if x == 'world' else x
|
||||
...
|
||||
>>> from scrapy.contrib.loader.processor import ApplyConcat
|
||||
>>> proc = ApplyConcat(filter_world, str.upper)
|
||||
>>> proc(['hello', 'world', 'this', 'is', 'scrapy'])
|
||||
['HELLO, 'THIS', 'IS', 'SCRAPY']
|
||||
|
||||
.. class:: TakeFirst
|
||||
|
||||
Return the first non null/empty value from the values to received, so it's
|
||||
typically used as output processor of single-valued fields. It doesn't
|
||||
receive any constructor arguments, nor accepts a Item Loader context.
|
||||
|
||||
Example::
|
||||
|
||||
>>> from scrapy.contrib.loader.processor import TakeFirst
|
||||
>>> proc = TakeFirst()
|
||||
>>> proc(['', 'one', 'two', 'three'])
|
||||
'one'
|
||||
|
||||
.. class:: Identity
|
||||
|
||||
Return the original values unchanged. It doesn't receive any constructor
|
||||
arguments nor accepts a Item Loader context.
|
||||
|
||||
Example::
|
||||
|
||||
>>> from scrapy.contrib.loader.processor import Identity
|
||||
>>> proc = Identity()
|
||||
>>> proc(['one', 'two', 'three'])
|
||||
['one', 'two', 'three']
|
||||
|
||||
.. class:: Join(separator=u' ')
|
||||
|
||||
Return the values joined with the separator given in the constructor, which
|
||||
defaults to ``u' '``. It doesn't accept a Item Loader context.
|
||||
|
||||
When using the default separator, this processor is equivalent to the
|
||||
function: ``u' '.join``
|
||||
|
||||
Examples::
|
||||
|
||||
>>> from scrapy.contrib.loader.processor import Join
|
||||
>>> proc = Join()
|
||||
>>> proc(['one', 'two', 'three'])
|
||||
u'one two three'
|
||||
>>> proc = Join('<br>')
|
||||
>>> proc(['one', 'two', 'three'])
|
||||
u'one<br>two<br>three'
|
@ -1,13 +0,0 @@
|
||||
"""Common functions used in Item Parsers code"""
|
||||
|
||||
from functools import partial
|
||||
from scrapy.utils.python import get_func_args
|
||||
|
||||
def wrap_parser_context(function, context):
|
||||
"""Wrap functions that receive parser_context to contain those parser
|
||||
arguments pre-loaded and expose a interface that receives only one argument
|
||||
"""
|
||||
if 'parser_context' in get_func_args(function):
|
||||
return partial(function, parser_context=context)
|
||||
else:
|
||||
return function
|
@ -1,7 +1,7 @@
|
||||
"""
|
||||
Item Parser
|
||||
Item Loader
|
||||
|
||||
See documentation in docs/topics/itemparser.rst
|
||||
See documentation in docs/topics/loaders.rst
|
||||
"""
|
||||
|
||||
from collections import defaultdict
|
||||
@ -9,14 +9,14 @@ from collections import defaultdict
|
||||
from scrapy.newitem import Item
|
||||
from scrapy.xpath import HtmlXPathSelector
|
||||
from scrapy.utils.misc import arg_to_iter
|
||||
from .common import wrap_parser_context
|
||||
from .parsers import Identity
|
||||
from .common import wrap_loader_context
|
||||
from .processor import Identity
|
||||
|
||||
class ItemParser(object):
|
||||
class ItemLoader(object):
|
||||
|
||||
default_item_class = Item
|
||||
default_input_parser = Identity()
|
||||
default_output_parser = Identity()
|
||||
default_input_processor = Identity()
|
||||
default_output_processor = Identity()
|
||||
|
||||
def __init__(self, item=None, **context):
|
||||
if item is None:
|
||||
@ -40,34 +40,34 @@ class ItemParser(object):
|
||||
return item
|
||||
|
||||
def get_output_value(self, field_name):
|
||||
parser = self.get_output_parser(field_name)
|
||||
parser = wrap_parser_context(parser, self.context)
|
||||
return parser(self._values[field_name])
|
||||
proc = self.get_output_processor(field_name)
|
||||
proc = wrap_loader_context(proc, self.context)
|
||||
return proc(self._values[field_name])
|
||||
|
||||
def get_collected_values(self, field_name):
|
||||
return self._values[field_name]
|
||||
|
||||
def get_input_parser(self, field_name):
|
||||
parser = getattr(self, '%s_in' % field_name, None)
|
||||
if not parser:
|
||||
parser = self.item.fields[field_name].get('input_parser', \
|
||||
self.default_input_parser)
|
||||
return parser
|
||||
def get_input_processor(self, field_name):
|
||||
proc = getattr(self, '%s_in' % field_name, None)
|
||||
if not proc:
|
||||
proc = self.item.fields[field_name].get('input_processor', \
|
||||
self.default_input_processor)
|
||||
return proc
|
||||
|
||||
def get_output_parser(self, field_name):
|
||||
parser = getattr(self, '%s_out' % field_name, None)
|
||||
if not parser:
|
||||
parser = self.item.fields[field_name].get('output_parser', \
|
||||
self.default_output_parser)
|
||||
return parser
|
||||
def get_output_processor(self, field_name):
|
||||
proc = getattr(self, '%s_out' % field_name, None)
|
||||
if not proc:
|
||||
proc = self.item.fields[field_name].get('output_processor', \
|
||||
self.default_output_processor)
|
||||
return proc
|
||||
|
||||
def _parse_input_value(self, field_name, value):
|
||||
parser = self.get_input_parser(field_name)
|
||||
parser = wrap_parser_context(parser, self.context)
|
||||
return parser(value)
|
||||
proc = self.get_input_processor(field_name)
|
||||
proc = wrap_loader_context(proc, self.context)
|
||||
return proc(value)
|
||||
|
||||
|
||||
class XPathItemParser(ItemParser):
|
||||
class XPathItemLoader(ItemLoader):
|
||||
|
||||
default_selector_class = HtmlXPathSelector
|
||||
|
||||
@ -79,7 +79,7 @@ class XPathItemParser(ItemParser):
|
||||
selector = self.default_selector_class(response)
|
||||
self.selector = selector
|
||||
context.update(selector=selector, response=response)
|
||||
super(XPathItemParser, self).__init__(item, **context)
|
||||
super(XPathItemLoader, self).__init__(item, **context)
|
||||
|
||||
def add_xpath(self, field_name, xpath, re=None):
|
||||
self.add_value(field_name, self._get_values(field_name, xpath, re))
|
13
scrapy/contrib/loader/common.py
Normal file
13
scrapy/contrib/loader/common.py
Normal file
@ -0,0 +1,13 @@
|
||||
"""Common functions used in Item Loaders code"""
|
||||
|
||||
from functools import partial
|
||||
from scrapy.utils.python import get_func_args
|
||||
|
||||
def wrap_loader_context(function, context):
|
||||
"""Wrap functions that receive loader_context to contain the context
|
||||
"pre-loaded" and expose a interface that receives only one argument
|
||||
"""
|
||||
if 'loader_context' in get_func_args(function):
|
||||
return partial(function, loader_context=context)
|
||||
else:
|
||||
return function
|
@ -1,26 +1,26 @@
|
||||
"""
|
||||
This module provides some commonly used parser functions for Item Parsers.
|
||||
This module provides some commonly used processors for Item Loaders.
|
||||
|
||||
See documentation in docs/topics/itemparser.rst
|
||||
See documentation in docs/topics/loaders.rst
|
||||
"""
|
||||
|
||||
from scrapy.utils.misc import arg_to_iter
|
||||
from scrapy.utils.datatypes import MergeDict
|
||||
from .common import wrap_parser_context
|
||||
from .common import wrap_loader_context
|
||||
|
||||
class ApplyConcat(object):
|
||||
|
||||
def __init__(self, *functions, **default_parser_context):
|
||||
def __init__(self, *functions, **default_loader_context):
|
||||
self.functions = functions
|
||||
self.default_parser_context = default_parser_context
|
||||
self.default_loader_context = default_loader_context
|
||||
|
||||
def __call__(self, value, parser_context=None):
|
||||
def __call__(self, value, loader_context=None):
|
||||
values = arg_to_iter(value)
|
||||
if parser_context:
|
||||
context = MergeDict(parser_context, self.default_parser_context)
|
||||
if loader_context:
|
||||
context = MergeDict(loader_context, self.default_loader_context)
|
||||
else:
|
||||
context = self.default_parser_context
|
||||
wrapped_funcs = [wrap_parser_context(f, context) for f in self.functions]
|
||||
context = self.default_loader_context
|
||||
wrapped_funcs = [wrap_loader_context(f, context) for f in self.functions]
|
||||
for func in wrapped_funcs:
|
||||
next_values = []
|
||||
for v in values:
|
@ -1,7 +1,7 @@
|
||||
import unittest
|
||||
|
||||
from scrapy.contrib.itemparser import ItemParser, XPathItemParser
|
||||
from scrapy.contrib.itemparser.parsers import ApplyConcat, Join, Identity
|
||||
from scrapy.contrib.loader import ItemLoader, XPathItemLoader
|
||||
from scrapy.contrib.loader.processor import ApplyConcat, Join, Identity
|
||||
from scrapy.newitem import Item, Field
|
||||
from scrapy.xpath import HtmlXPathSelector
|
||||
from scrapy.http import HtmlResponse
|
||||
@ -15,30 +15,30 @@ class TestItem(NameItem):
|
||||
url = Field()
|
||||
summary = Field()
|
||||
|
||||
# test item parsers
|
||||
# test item loaders
|
||||
|
||||
class NameItemParser(ItemParser):
|
||||
class NameItemLoader(ItemLoader):
|
||||
default_item_class = TestItem
|
||||
|
||||
class TestItemParser(NameItemParser):
|
||||
class TestItemLoader(NameItemLoader):
|
||||
name_in = ApplyConcat(lambda v: v.title())
|
||||
|
||||
class DefaultedItemParser(NameItemParser):
|
||||
default_input_parser = ApplyConcat(lambda v: v[:-1])
|
||||
class DefaultedItemLoader(NameItemLoader):
|
||||
default_input_processor = ApplyConcat(lambda v: v[:-1])
|
||||
|
||||
# test parsers
|
||||
# test processors
|
||||
|
||||
def parser_with_args(value, other=None, parser_context=None):
|
||||
if 'key' in parser_context:
|
||||
return parser_context['key']
|
||||
def processor_with_args(value, other=None, loader_context=None):
|
||||
if 'key' in loader_context:
|
||||
return loader_context['key']
|
||||
return value
|
||||
|
||||
class ItemParserTest(unittest.TestCase):
|
||||
class ItemLoaderTest(unittest.TestCase):
|
||||
|
||||
def test_populate_item_using_default_loader(self):
|
||||
i = TestItem()
|
||||
i['summary'] = u'lala'
|
||||
ip = ItemParser(item=i)
|
||||
ip = ItemLoader(item=i)
|
||||
ip.add_value('name', u'marta')
|
||||
item = ip.populate_item()
|
||||
assert item is i
|
||||
@ -46,13 +46,13 @@ class ItemParserTest(unittest.TestCase):
|
||||
self.assertEqual(item['name'], [u'marta'])
|
||||
|
||||
def test_populate_item_using_custom_loader(self):
|
||||
ip = TestItemParser()
|
||||
ip = TestItemLoader()
|
||||
ip.add_value('name', u'marta')
|
||||
item = ip.populate_item()
|
||||
self.assertEqual(item['name'], [u'Marta'])
|
||||
|
||||
def test_add_value(self):
|
||||
ip = TestItemParser()
|
||||
ip = TestItemLoader()
|
||||
ip.add_value('name', u'marta')
|
||||
self.assertEqual(ip.get_collected_values('name'), [u'Marta'])
|
||||
self.assertEqual(ip.get_output_value('name'), [u'Marta'])
|
||||
@ -61,7 +61,7 @@ class ItemParserTest(unittest.TestCase):
|
||||
self.assertEqual(ip.get_output_value('name'), [u'Marta', u'Pepe'])
|
||||
|
||||
def test_replace_value(self):
|
||||
ip = TestItemParser()
|
||||
ip = TestItemLoader()
|
||||
ip.replace_value('name', u'marta')
|
||||
self.assertEqual(ip.get_collected_values('name'), [u'Marta'])
|
||||
self.assertEqual(ip.get_output_value('name'), [u'Marta'])
|
||||
@ -69,208 +69,208 @@ class ItemParserTest(unittest.TestCase):
|
||||
self.assertEqual(ip.get_collected_values('name'), [u'Pepe'])
|
||||
self.assertEqual(ip.get_output_value('name'), [u'Pepe'])
|
||||
|
||||
def test_map_concat_filter(self):
|
||||
def test_apply_concat_filter(self):
|
||||
def filter_world(x):
|
||||
return None if x == 'world' else x
|
||||
|
||||
parser = ApplyConcat(filter_world, str.upper)
|
||||
self.assertEqual(parser(['hello', 'world', 'this', 'is', 'scrapy']),
|
||||
proc = ApplyConcat(filter_world, str.upper)
|
||||
self.assertEqual(proc(['hello', 'world', 'this', 'is', 'scrapy']),
|
||||
['HELLO', 'THIS', 'IS', 'SCRAPY'])
|
||||
|
||||
def test_map_concat_filter_multiple_functions(self):
|
||||
class TestItemParser(NameItemParser):
|
||||
class TestItemLoader(NameItemLoader):
|
||||
name_in = ApplyConcat(lambda v: v.title(), lambda v: v[:-1])
|
||||
|
||||
ip = TestItemParser()
|
||||
ip = TestItemLoader()
|
||||
ip.add_value('name', u'marta')
|
||||
self.assertEqual(ip.get_output_value('name'), [u'Mart'])
|
||||
item = ip.populate_item()
|
||||
self.assertEqual(item['name'], [u'Mart'])
|
||||
|
||||
def test_default_input_parser(self):
|
||||
ip = DefaultedItemParser()
|
||||
def test_default_input_processor(self):
|
||||
ip = DefaultedItemLoader()
|
||||
ip.add_value('name', u'marta')
|
||||
self.assertEqual(ip.get_output_value('name'), [u'mart'])
|
||||
|
||||
def test_inherited_default_input_parser(self):
|
||||
class InheritDefaultedItemParser(DefaultedItemParser):
|
||||
def test_inherited_default_input_processor(self):
|
||||
class InheritDefaultedItemLoader(DefaultedItemLoader):
|
||||
pass
|
||||
|
||||
ip = InheritDefaultedItemParser()
|
||||
ip = InheritDefaultedItemLoader()
|
||||
ip.add_value('name', u'marta')
|
||||
self.assertEqual(ip.get_output_value('name'), [u'mart'])
|
||||
|
||||
def test_input_parser_inheritance(self):
|
||||
class ChildItemParser(TestItemParser):
|
||||
def test_input_processor_inheritance(self):
|
||||
class ChildItemLoader(TestItemLoader):
|
||||
url_in = ApplyConcat(lambda v: v.lower())
|
||||
|
||||
ip = ChildItemParser()
|
||||
ip = ChildItemLoader()
|
||||
ip.add_value('url', u'HTTP://scrapy.ORG')
|
||||
self.assertEqual(ip.get_output_value('url'), [u'http://scrapy.org'])
|
||||
ip.add_value('name', u'marta')
|
||||
self.assertEqual(ip.get_output_value('name'), [u'Marta'])
|
||||
|
||||
class ChildChildItemParser(ChildItemParser):
|
||||
class ChildChildItemLoader(ChildItemLoader):
|
||||
url_in = ApplyConcat(lambda v: v.upper())
|
||||
summary_in = ApplyConcat(lambda v: v)
|
||||
|
||||
ip = ChildChildItemParser()
|
||||
ip = ChildChildItemLoader()
|
||||
ip.add_value('url', u'http://scrapy.org')
|
||||
self.assertEqual(ip.get_output_value('url'), [u'HTTP://SCRAPY.ORG'])
|
||||
ip.add_value('name', u'marta')
|
||||
self.assertEqual(ip.get_output_value('name'), [u'Marta'])
|
||||
|
||||
def test_empty_map_concat(self):
|
||||
class IdentityDefaultedItemParser(DefaultedItemParser):
|
||||
class IdentityDefaultedItemLoader(DefaultedItemLoader):
|
||||
name_in = ApplyConcat()
|
||||
|
||||
ip = IdentityDefaultedItemParser()
|
||||
ip = IdentityDefaultedItemLoader()
|
||||
ip.add_value('name', u'marta')
|
||||
self.assertEqual(ip.get_output_value('name'), [u'marta'])
|
||||
|
||||
def test_identity_input_parser(self):
|
||||
class IdentityDefaultedItemParser(DefaultedItemParser):
|
||||
def test_identity_input_processor(self):
|
||||
class IdentityDefaultedItemLoader(DefaultedItemLoader):
|
||||
name_in = Identity()
|
||||
|
||||
ip = IdentityDefaultedItemParser()
|
||||
ip = IdentityDefaultedItemLoader()
|
||||
ip.add_value('name', u'marta')
|
||||
self.assertEqual(ip.get_output_value('name'), [u'marta'])
|
||||
|
||||
def test_extend_custom_input_parsers(self):
|
||||
class ChildItemParser(TestItemParser):
|
||||
name_in = ApplyConcat(TestItemParser.name_in, unicode.swapcase)
|
||||
def test_extend_custom_input_processors(self):
|
||||
class ChildItemLoader(TestItemLoader):
|
||||
name_in = ApplyConcat(TestItemLoader.name_in, unicode.swapcase)
|
||||
|
||||
ip = ChildItemParser()
|
||||
ip = ChildItemLoader()
|
||||
ip.add_value('name', u'marta')
|
||||
self.assertEqual(ip.get_output_value('name'), [u'mARTA'])
|
||||
|
||||
def test_extend_default_input_parsers(self):
|
||||
class ChildDefaultedItemParser(DefaultedItemParser):
|
||||
name_in = ApplyConcat(DefaultedItemParser.default_input_parser, unicode.swapcase)
|
||||
def test_extend_default_input_processors(self):
|
||||
class ChildDefaultedItemLoader(DefaultedItemLoader):
|
||||
name_in = ApplyConcat(DefaultedItemLoader.default_input_processor, unicode.swapcase)
|
||||
|
||||
ip = ChildDefaultedItemParser()
|
||||
ip = ChildDefaultedItemLoader()
|
||||
ip.add_value('name', u'marta')
|
||||
self.assertEqual(ip.get_output_value('name'), [u'MART'])
|
||||
|
||||
def test_output_parser_using_function(self):
|
||||
ip = TestItemParser()
|
||||
def test_output_processor_using_function(self):
|
||||
ip = TestItemLoader()
|
||||
ip.add_value('name', [u'mar', u'ta'])
|
||||
self.assertEqual(ip.get_output_value('name'), [u'Mar', u'Ta'])
|
||||
|
||||
class TakeFirstItemParser(TestItemParser):
|
||||
class TakeFirstItemLoader(TestItemLoader):
|
||||
name_out = u" ".join
|
||||
|
||||
ip = TakeFirstItemParser()
|
||||
ip = TakeFirstItemLoader()
|
||||
ip.add_value('name', [u'mar', u'ta'])
|
||||
self.assertEqual(ip.get_output_value('name'), u'Mar Ta')
|
||||
|
||||
def test_output_parser_using_classes(self):
|
||||
ip = TestItemParser()
|
||||
def test_output_processor_using_classes(self):
|
||||
ip = TestItemLoader()
|
||||
ip.add_value('name', [u'mar', u'ta'])
|
||||
self.assertEqual(ip.get_output_value('name'), [u'Mar', u'Ta'])
|
||||
|
||||
class TakeFirstItemParser(TestItemParser):
|
||||
class TakeFirstItemLoader(TestItemLoader):
|
||||
name_out = Join()
|
||||
|
||||
ip = TakeFirstItemParser()
|
||||
ip = TakeFirstItemLoader()
|
||||
ip.add_value('name', [u'mar', u'ta'])
|
||||
self.assertEqual(ip.get_output_value('name'), u'Mar Ta')
|
||||
|
||||
class TakeFirstItemParser(TestItemParser):
|
||||
class TakeFirstItemLoader(TestItemLoader):
|
||||
name_out = Join("<br>")
|
||||
|
||||
ip = TakeFirstItemParser()
|
||||
ip = TakeFirstItemLoader()
|
||||
ip.add_value('name', [u'mar', u'ta'])
|
||||
self.assertEqual(ip.get_output_value('name'), u'Mar<br>Ta')
|
||||
|
||||
def test_default_output_parser(self):
|
||||
ip = TestItemParser()
|
||||
def test_default_output_processor(self):
|
||||
ip = TestItemLoader()
|
||||
ip.add_value('name', [u'mar', u'ta'])
|
||||
self.assertEqual(ip.get_output_value('name'), [u'Mar', u'Ta'])
|
||||
|
||||
class LalaItemParser(TestItemParser):
|
||||
default_output_parser = Identity()
|
||||
class LalaItemLoader(TestItemLoader):
|
||||
default_output_processor = Identity()
|
||||
|
||||
ip = LalaItemParser()
|
||||
ip = LalaItemLoader()
|
||||
ip.add_value('name', [u'mar', u'ta'])
|
||||
self.assertEqual(ip.get_output_value('name'), [u'Mar', u'Ta'])
|
||||
|
||||
def test_parser_context_on_declaration(self):
|
||||
class ChildItemParser(TestItemParser):
|
||||
url_in = ApplyConcat(parser_with_args, key=u'val')
|
||||
def test_loader_context_on_declaration(self):
|
||||
class ChildItemLoader(TestItemLoader):
|
||||
url_in = ApplyConcat(processor_with_args, key=u'val')
|
||||
|
||||
ip = ChildItemParser()
|
||||
ip = ChildItemLoader()
|
||||
ip.add_value('url', u'text')
|
||||
self.assertEqual(ip.get_output_value('url'), ['val'])
|
||||
ip.replace_value('url', u'text2')
|
||||
self.assertEqual(ip.get_output_value('url'), ['val'])
|
||||
|
||||
def test_parser_context_on_instantiation(self):
|
||||
class ChildItemParser(TestItemParser):
|
||||
url_in = ApplyConcat(parser_with_args)
|
||||
def test_loader_context_on_instantiation(self):
|
||||
class ChildItemLoader(TestItemLoader):
|
||||
url_in = ApplyConcat(processor_with_args)
|
||||
|
||||
ip = ChildItemParser(key=u'val')
|
||||
ip = ChildItemLoader(key=u'val')
|
||||
ip.add_value('url', u'text')
|
||||
self.assertEqual(ip.get_output_value('url'), ['val'])
|
||||
ip.replace_value('url', u'text2')
|
||||
self.assertEqual(ip.get_output_value('url'), ['val'])
|
||||
|
||||
def test_parser_context_on_assign(self):
|
||||
class ChildItemParser(TestItemParser):
|
||||
url_in = ApplyConcat(parser_with_args)
|
||||
def test_loader_context_on_assign(self):
|
||||
class ChildItemLoader(TestItemLoader):
|
||||
url_in = ApplyConcat(processor_with_args)
|
||||
|
||||
ip = ChildItemParser()
|
||||
ip = ChildItemLoader()
|
||||
ip.context['key'] = u'val'
|
||||
ip.add_value('url', u'text')
|
||||
self.assertEqual(ip.get_output_value('url'), ['val'])
|
||||
ip.replace_value('url', u'text2')
|
||||
self.assertEqual(ip.get_output_value('url'), ['val'])
|
||||
|
||||
def test_item_passed_to_input_parser_functions(self):
|
||||
def parser(value, parser_context):
|
||||
return parser_context['item']['name']
|
||||
def test_item_passed_to_input_processor_functions(self):
|
||||
def processor(value, loader_context):
|
||||
return loader_context['item']['name']
|
||||
|
||||
class ChildItemParser(TestItemParser):
|
||||
url_in = ApplyConcat(parser)
|
||||
class ChildItemLoader(TestItemLoader):
|
||||
url_in = ApplyConcat(processor)
|
||||
|
||||
it = TestItem(name='marta')
|
||||
ip = ChildItemParser(item=it)
|
||||
ip = ChildItemLoader(item=it)
|
||||
ip.add_value('url', u'text')
|
||||
self.assertEqual(ip.get_output_value('url'), ['marta'])
|
||||
ip.replace_value('url', u'text2')
|
||||
self.assertEqual(ip.get_output_value('url'), ['marta'])
|
||||
|
||||
def test_add_value_on_unknown_field(self):
|
||||
ip = TestItemParser()
|
||||
ip = TestItemLoader()
|
||||
self.assertRaises(KeyError, ip.add_value, 'wrong_field', [u'lala', u'lolo'])
|
||||
|
||||
|
||||
class TestXPathItemParser(XPathItemParser):
|
||||
class TestXPathItemLoader(XPathItemLoader):
|
||||
default_item_class = TestItem
|
||||
name_in = ApplyConcat(lambda v: v.title())
|
||||
|
||||
class XPathItemParserTest(unittest.TestCase):
|
||||
class XPathItemLoaderTest(unittest.TestCase):
|
||||
|
||||
def test_constructor_errors(self):
|
||||
self.assertRaises(RuntimeError, XPathItemParser)
|
||||
self.assertRaises(RuntimeError, XPathItemLoader)
|
||||
|
||||
def test_constructor_with_selector(self):
|
||||
sel = HtmlXPathSelector(text=u"<html><body><div>marta</div></body></html>")
|
||||
l = TestXPathItemParser(selector=sel)
|
||||
l = TestXPathItemLoader(selector=sel)
|
||||
self.assert_(l.selector is sel)
|
||||
l.add_xpath('name', '//div/text()')
|
||||
self.assertEqual(l.get_output_value('name'), [u'Marta'])
|
||||
|
||||
def test_constructor_with_response(self):
|
||||
response = HtmlResponse(url="", body="<html><body><div>marta</div></body></html>")
|
||||
l = TestXPathItemParser(response=response)
|
||||
l = TestXPathItemLoader(response=response)
|
||||
self.assert_(l.selector)
|
||||
l.add_xpath('name', '//div/text()')
|
||||
self.assertEqual(l.get_output_value('name'), [u'Marta'])
|
||||
|
||||
def test_add_xpath_re(self):
|
||||
response = HtmlResponse(url="", body="<html><body><div>marta</div></body></html>")
|
||||
l = TestXPathItemParser(response=response)
|
||||
l = TestXPathItemLoader(response=response)
|
||||
l.add_xpath('name', '//div/text()', re='ma')
|
||||
self.assertEqual(l.get_output_value('name'), [u'Ma'])
|
||||
|
Loading…
x
Reference in New Issue
Block a user