1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-23 18:44:05 +00:00

Moved Item Loader to its final location in scrapy.contrib.loader, and updated doc/tests

--HG--
rename : docs/experimental/itemparser.rst => docs/experimental/loaders.rst
rename : scrapy/contrib/itemparser/__init__.py => scrapy/contrib/loader/__init__.py
rename : scrapy/contrib/itemparser/common.py => scrapy/contrib/loader/common.py
rename : scrapy/contrib/itemparser/parsers.py => scrapy/contrib/loader/processor.py
rename : scrapy/tests/test_itemparser.py => scrapy/tests/test_contrib_loader.py
This commit is contained in:
Pablo Hoffman 2009-08-12 16:49:07 -03:00
parent 7cbbc3ffb0
commit 1dc592882b
7 changed files with 661 additions and 660 deletions

View File

@ -1,526 +0,0 @@
.. _topics-itemparser:
============
Item Parsers
============
.. module:: scrapy.contrib.itemparser
:synopsis: Item Parser class
Item Parser provide a convenient mechanism for populating scraped :ref:`Items
<topics-newitems>`. Even though Items can be populated using their own
dictionary-like API, the Item Parsers provide a much more convenient API for
populating them from a scraping process, by automating some common tasks like
parsing the raw extracted data before assigning it.
In other words, :ref:`Items <topics-newitems>` provide the *container* of
scraped data, while Item Parsers provide the mechanism for *populating* that
container.
Item Parsers are designed to provide a flexible, efficient and easy mechanism
for extending and overriding different field parsing rules, either by spider,
or by source format (HTML, XML, etc) without becoming a nightmare to maintain.
Using Item Parsers to populate items
====================================
To use an Item Parser, you must first instantiate it. You can either
instantiate it with an Item object or without one, in which case an Item is
automatically instantiated in the Item Parser constructor using the Item class
specified in the :attr:`ItemParser.default_item_class` attribute.
Then, you start collecting values into the Item Parser, typically using using
:ref:`XPath Selectors <topics-selectors>`. You can add more than one value to
the same item field, the Item Parser will know how to "join" those values later
using a proper parser function.
Here is a typical Item Parser usage in a :ref:`Spider <topics-spiders>`, using
the :ref:`Product item <topics-newitems-declaring>` declared in the :ref:`Items
chapter <topics-newitems>`::
from scrapy.contrib.itemparser import XPathItemParser
from scrapy.xpath import HtmlXPathSelector
from myproject.items import Product
def parse(self, response):
p = XPathItemParser(item=Product(), response=response)
p.add_xpath('name', '//div[@class="product_name"]')
p.add_xpath('name', '//div[@class="product_title"]')
p.add_xpath('price', '//p[@id="price"]')
p.add_xpath('stock', '//p[@id="stock"]')
p.add_value('last_updated', 'today') # you can also use literal values
return p.populate_item()
By quickly looking at that code we can see the ``name`` field is being
extracted from two different XPath locations in the page:
1. ``//div[@class="product_name"]``
2. ``//div[@class="product_title"]``
In other words, data is being collected by extracting it from two XPath
locations, using the :meth:`~XPathItemParser.add_xpath` method. This is the data
that will be assigned to the ``name`` field later.
Afterwards, similar calls are used for ``price`` and ``stock`` fields, and
finally the ``last_update`` field is populated directly with a literal value
(``today``) using a different method: :meth:`~ItemParser.add_value`.
Finally, when all data is collected, the :meth:`ItemParser.populate_item`
method is called which actually populates and returns the item populated with
the data previously extracted and collected with the
:meth:`~XPathItemParser.add_xpath` and :meth:`~ItemParser.add_value` calls.
.. _topics-itemparser-parsers:
Input and Output parsers
========================
An Item Parser contains one input parser and one output parser for each (item)
field. The input parser processes the extracted data as soon as it's received
(through the :meth:`~XPathItemParser.add_xpath` or
:meth:`~ItemParser.add_value` methods) and the result of the input parser is
collected and kept inside the ItemParser. After collecting all data, the
:meth:`ItemParser.populate_item` method is called to populate and get the
populated :class:`~scrapy.newitem.Item` object. That's when the output parser
is called with the data previously collected (and processed using the input
parser). The result of the output parser is the final value that gets assigned
to the item.
Let's see an example to illustrate how this input and output parsers are
called for a particular field (the same applies for any other field)::
p = XPathItemParser(Product(), some_xpath_selector)
p.add_xpath('name', xpath1) # (1)
p.add_xpath('name', xpath2) # (2)
return p.populate_item() # (3)
So what happens is:
1. Data from ``xpath1`` is extracted, and passed through the *input parser* of
the ``name`` field. The result of the input parser is collected and kept in
the Item Parser (but not yet assigned to the item).
2. Data from ``xpath2`` is extracted, and passed through the same *input
parser* used in (1). The result of the input parser is appended to the data
collected in (1) (if any).
3. The data collected in (1) and (2) is passed through the *output parser* of
the ``name`` field. The result of the output parser is the value assigned to
the ``name`` field in the item.
It's worth noticing that parsers are just callable objects, which are called
with the data to be parsed, and return a parsed value. So you can use any
function as input or output parser, provided they can receive only one
positional (required) argument.
The other thing you need to keep in mind is that the values returned by input
parsers are collected internally (in lists) and then passed to output parsers
to populate the fields, so output parsers should expect iterables as input.
Last, but not least, Scrapy comes with some :ref:`commonly used parsers
<topics-itemparser-available-parsers>` built-in for convenience.
Declaring Item Parsers
======================
Item Parsers are declared like Items, by using a class definition syntax. Here
is an example::
from scrapy.contrib.itemparser import ItemParser
from scrapy.contrib.itemparser.parsers import TakeFirst, ApplyConcat, Join
class ProductParser(ItemParser):
default_expander = TakeFirst()
name_in = ApplyConcat(unicode.title)
name_out = Join()
price_in = ApplyConcat(unicode.strip)
price_out = TakeFirst()
# ...
As you can see, input parsers are declared using the ``_in`` suffix while
output parsers are declared using the ``_out`` suffix. And you can also declare
a default input/output parsers using the
:attr:`ItemParser.default_input_parser` and
:attr:`ItemParser.default_output_parser` attributes.
.. _topics-itemparser-parsers-declaring:
Declaring Input and Output Parsers
==================================
As seen in the previous section, input and output parsers can be declared in
the Item Parser definition, and it's very common to declare input parsers this
way. However, there is one more place where you can specify the input and
output parsers to use: in the :ref:`Item Field <topics-newitems-fields>`
metadata. Here is an example::
from scrapy.newitem import Item, Field
from scrapy.contrib.itemparser.parser import ApplyConcat, Join, TakeFirst
from scrapy.utils.markup import remove_entities
from myproject.utils import filter_prices
class Product(Item):
name = Field(
input_parser=ApplyConcat(remove_entities),
output_parser=Join(),
)
price = Field(
default=0,
input_parser=ApplyConcat(remove_entities, filter_prices),
output_parser=TakeFirst(),
)
The precedence order, for both input and output parsers, is as follows:
1. Item Parser field-specific attributes: ``field_in`` and ``field_out`` (most
precedence)
2. Field metadata (``input_parser`` and ``output_parser`` key)
3. Item Parser defaults: :meth:`ItemParser.default_expander` and
:meth:`ItemParser.default_output_parser` (least precedence)
See also: :ref:`topics-itemparser-extending`.
.. _topics-itemparser-context:
Item Parser Context
===================
The Item Parser Context is a dict of arbitrary key/values which is shared among
all input and output parsers in the Item Parser. It can be passed when
declaring, instantiating or using Item Parser. They are used to modify the
behaviour of the input/output parsers.
For example, suppose you have a function ``parse_length`` which receives a text
value and extracts a length from it::
def parse_length(text, parser_context):
unit = parser_context.get('unit', 'm')
# ... length parsing code goes here ...
return parsed_length
By accepting a ``parser_context`` argument the function is explicitly telling
the Item Parser that is able to receive an Item Parser context, so the Item
Parser passes the currently active context when calling it, and the parser
function (``parse_length`` in this case) can thus use them.
There are several ways to modify Item Parser context values:
1. By modifying the currently active Item Parser context
(:meth:`ItemParser.context` attribute)::
parser = ItemParser(product, unit='cm')
parser.context['unit'] = 'cm'
2. On Item Parser instantiation (the keyword arguments of Item Parser
constructor are stored in the Item Parser context)::
p = ItemParser(product, unit='cm')
2. On Item Parser declaration, for those input/output parsers that support
instatiating them with a Item Parser context. :class:`ApplyConcat` is one of
them::
class ProductParser(ItemParser):
length_out = ApplyConcat(parse_length, unit='cm')
ItemParser objects
==================
.. class:: ItemParser([item], \**kwargs)
Return a new Item Parser for populating the given Item. If no item is
given, one is instantiated automatically using the class in
:attr:`default_item_class`.
The item and the remaining keyword arguments are assigned to the Parser
context (accesible through the :attr:`context` attribute).
.. method:: add_value(field_name, value)
Add the given ``value`` for the given field.
The value is passed through the :ref:`field input parser
<topics-itemparser-parsers>` and its result appened to the data
collected for that field. If the field already contains collected data,
the new data is added.
Examples::
parser.add_value('name', u'Color TV')
parser.add_value('colours', [u'white', u'blue'])
parser.add_value('length', u'100', default_unit='cm')
.. method:: replace_value(field_name, value)
Similar to :meth:`add_value` but replaces the collected data with the
new value instead of adding it.
.. method:: populate_item()
Populate the item with the data collected so far, and return it. The
data collected is first passed through the :ref:`field output parsers
<topics-itemparser-parsers>` to get the final value to assign to each
item field.
.. method:: get_collected_values(field_name)
Return the collected values for the given field.
.. method:: get_output_value(field_name)
Return the collected values parsed using the output parser, for the
given field. This method doesn't populate or modify the item at all.
.. method:: get_input_parser(field_name)
Return the input parser for the given field.
.. method:: get_output_parser(field_name)
Return the output parser for the given field.
.. attribute:: item
The :class:`~scrapy.newitem.Item` object being parsed by this Item
Parser.
.. attribute:: context
The currently active :ref:`Context <topics-itemparser-context>` of this
Item Parser.
.. attribute:: default_item_class
An Item class (or factory), used to instantiate items when not given in
the constructor.
.. attribute:: default_input_parser
The default input parser to use for those fields which don't specify
one.
.. attribute:: default_output_parser
The default output parser to use for those fields which don't specify
one.
.. class:: XPathItemParser([item, selector, response], \**kwargs)
The :class:`XPathItemParser` class extends the :class:`ItemParser` class
providing more convenient mechanisms for extracting data from web pages
using :ref:`XPath selectors <topics-selectors>`.
:class:`XPathItemParser` objects accept two more additional parameters in
their constructors:
:param selector: The selector to extract data from, when using the
:meth:`add_xpath` or :meth:`replace_xpath` method.
:type selector: :class:`~scrapy.xpath.XPathSelector` object
:param response: The response used to construct the selector using the
:attr:`default_selector_class`, unless the selector argument is given,
in which case this argument is ignored.
:type response: :class:`~scrapy.http.Response` object
.. method:: add_xpath(field_name, xpath, re=None)
Similar to :meth:`ItemParser.add_value` but receives an XPath instead of a
value, which is used to extract a list of unicode strings from the
selector associated with this :class:`XPathItemParser`. If the ``re``
argument is given, it's used for extrating data from the selector using
the :meth:`~scrapy.xpath.XPathSelector.re` method.
:param xpath: the XPath to extract data from
:type xpath: str
:param re: a regular expression to use for extracting data from the
selected XPath region
:type re: str or compiled regex
Examples::
# HTML snippet: <p class="product-name">Color TV</p>
parser.add_xpath('name', '//p[@class="product-name"]')
# HTML snippet: <p id="price">the price is $1200</p>
parser.add_xpath('price', '//p[@id="price"]', re='the price is (.*)')
.. method:: replace_xpath(field_name, xpath, re=None)
Similar to :meth:`add_xpath` but replaces collected data instead of
adding it.
.. attribute:: default_selector_class
The class used to construct the :attr:`selector` of this
:class:`XPathItemParser`, if only a response is given in the constructor.
If a selector is given in the constructor this attribute is ignored.
This attribute is sometimes overridden in subclasses.
.. attribute:: selector
The :class:`~scrapy.xpath.XPathSelector` object to extract data from.
It's either the selector given in the constructor or one created from
the response given in the constructor using the
:attr:`default_selector_class`. This attribute is meant to be
read-only.
.. _topics-itemparser-extending:
Reusing and extending Item Parsers
==================================
As your project grows bigger and acquires more and more spiders, maintenance
becomes a fundamental problem, specially when you have to deal with many
different parsing rules for each spider, having a lot of exceptions, but also
wanting to reuse the common parsers.
Item Parsers are designed to ease the maintenance burden of parsing rules,
without loosing flexibility and, at the same time, providing a convenient
mechanism for extending and overriding them. For this reason Item Parsers
support traditional Python class inheritance for dealing with differences of
specific spiders (or group of spiders).
Suppose, for example, that some particular site encloses their product names in
three dashes (ie. ``---Plasma TV---``) and you don't want to end up scraping
those dashes in the final product names.
Here's how you can remove those dashes by reusing and extending the default
Product Item Parser (``ProductParser``)::
from scrapy.contrib.itemparser.parsers import ApplyConcat
from myproject.itemparsers import ProductParser
def strip_dashes(x):
return x.strip('-')
class SiteSpecificParser(ProductParser):
name_in = ApplyConcat(ProductParser.name_in, strip_dashes)
Another case where extending Item Parsers can be very helpful is when you have
multiple source formats, for example XML and HTML. In the XML version you may
want to remove ``CDATA`` occurrences. Here's an example of how to do it::
from scrapy.contrib.itemparser.parsers import ApplyConcat
from myproject.itemparsers import ProductParser
from myproject.utils.xml import remove_cdata
class XmlProductParser(ProductParser):
name_in = ApplyConcat(remove_cdata, ProductParser.name_in)
And that's how you typically extend input parsers.
As for output parsers, it is more common to declare them in the field metadata,
as they usually depend only on the field and not on each specific site parsing
rule (as input parsers do). See also:
:ref:`topics-itemparser-parsers-declaring`.
There are many other possible ways to extend, inherit and override your Item
Parsers, and different Item Parsers hierarchies may fit better for different
projects. Scrapy only provides the mechanism, it doesn't impose any specific
organization of your Parsers collection - that's up to you and your project
needs.
.. _topics-itemparser-available-parsers:
Available built-in parsers
==========================
Even though you can use any callable function as input and output parsers,
Scrapy provides some commonly used parsers, which are described below. Some of
them, like the :class:`ApplyConcat` (which is typically used as input parser)
composes the output of several functions executed in order, to produce the
final parsed value.
Here is a list of all built-in parsers:
.. _topics-itemparser-Applyconcat:
ApplyConcat parser
------------------
The ApplyConcat parser is the recommended parser to use if you want to
concatenate the processing of several functions in a pipeline.
.. module:: scrapy.contrib.itemparser.parsers
:synopsis: Parser functions to use with Item Parsers
.. class:: ApplyConcat(\*functions, \**default_parser_context)
A parser which applies the given functions consecutively, in order,
concatenating their results before next function call. So each function
returns a list of values (though it could return ``None`` or a signle value
too) and the next function is called once for each of those values,
receiving one of those values as input each time. The output of each
function call (for each input value) is concatenated and each values of the
concatenation is used to call the next function, and the process repeats
until there are no functions left.
Each function can optionally receive a ``parser_context`` parameter, which
will contain the currently active :ref:`Item Parser context
<topics-itemparser-context>`.
The keyword arguments passed in the consturctor are used as the default
Item Parser context values passed on each function call. However, the final
Item Parser context values passed to funtions get overriden with the
currently active Item Parser context accesible through the
:meth:`ItemParser.context` attribute.
Example::
>>> def filter_world(x):
... return None if x == 'world' else x
...
>>> from scrapy.contrib.itemparser.parsers import ApplyConcat
>>> parser = ApplyConcat(filter_world, str.upper)
>>> parser(['hello', 'world', 'this', 'is', 'scrapy'])
['HELLO, 'THIS', 'IS', 'SCRAPY']
.. class:: TakeFirst
Return the first non null/empty value from the values to received, so it's
typically used as output parser of single-valued fields. It doesn't receive
any constructor arguments, nor accepts a Item Parser context.
Example::
>>> from scrapy.contrib.itemparser.parsers import TakeFirst
>>> parser = TakeFirst()
>>> parser(['', 'one', 'two', 'three'])
'one'
.. class:: Identity
Return the original values unchanged. It doesn't receive any constructor
arguments nor accepts a Item Parser context.
Example::
>>> from scrapy.contrib.itemparser.parsers import Identity
>>> parser = Identity()
>>> parser(['one', 'two', 'three'])
['one', 'two', 'three']
.. class:: Join(separator=u' ')
Return the values joined with the separator given in the constructor, which
defaults to ``u' '``. It doesn't accept a Item Parser context.
When using the default separator, this parser is equivalent to the
function: ``u' '.join``
Examples::
>>> from scrapy.contrib.itemparser.parsers import Join
>>> parser = Join()
>>> parser(['one', 'two', 'three'])
u'one two three'
>>> parser = Join('<br>')
>>> parser(['one', 'two', 'three'])
u'one<br>two<br>three'

View File

@ -0,0 +1,527 @@
.. _topics-loaders:
============
Item Loaders
============
.. module:: scrapy.contrib.loader
:synopsis: Item Loader class
Item Loaders provide a convenient mechanism for populating scraped :ref:`Items
<topics-newitems>`. Even though Items can be populated using their own
dictionary-like API, the Item Loaders provide a much more convenient API for
populating them from a scraping process, by automating some common tasks like
parsing the raw extracted data before assigning it.
In other words, :ref:`Items <topics-newitems>` provide the *container* of
scraped data, while Item Loaders provide the mechanism for *populating* that
container.
Item Loaders are designed to provide a flexible, efficient and easy mechanism
for extending and overriding different field parsing rules, either by spider,
or by source format (HTML, XML, etc) without becoming a nightmare to maintain.
Using Item Loaders to populate items
====================================
To use an Item Loader, you must first instantiate it. You can either
instantiate it with an Item object or without one, in which case an Item is
automatically instantiated in the Item Loader constructor using the Item class
specified in the :attr:`ItemLoader.default_item_class` attribute.
Then, you start collecting values into the Item Loader, typically using using
:ref:`XPath Selectors <topics-selectors>`. You can add more than one value to
the same item field, the Item Loader will know how to "join" those values later
using a proper processing function.
Here is a typical Item Loader usage in a :ref:`Spider <topics-spiders>`, using
the :ref:`Product item <topics-newitems-declaring>` declared in the :ref:`Items
chapter <topics-newitems>`::
from scrapy.contrib.loader import XPathItemLoader
from scrapy.xpath import HtmlXPathSelector
from myproject.items import Product
def parse(self, response):
p = XPathItemLoader(item=Product(), response=response)
p.add_xpath('name', '//div[@class="product_name"]')
p.add_xpath('name', '//div[@class="product_title"]')
p.add_xpath('price', '//p[@id="price"]')
p.add_xpath('stock', '//p[@id="stock"]')
p.add_value('last_updated', 'today') # you can also use literal values
return p.populate_item()
By quickly looking at that code we can see the ``name`` field is being
extracted from two different XPath locations in the page:
1. ``//div[@class="product_name"]``
2. ``//div[@class="product_title"]``
In other words, data is being collected by extracting it from two XPath
locations, using the :meth:`~XPathItemLoader.add_xpath` method. This is the data
that will be assigned to the ``name`` field later.
Afterwards, similar calls are used for ``price`` and ``stock`` fields, and
finally the ``last_update`` field is populated directly with a literal value
(``today``) using a different method: :meth:`~ItemLoader.add_value`.
Finally, when all data is collected, the :meth:`ItemLoader.populate_item`
method is called which actually populates and returns the item populated with
the data previously extracted and collected with the
:meth:`~XPathItemLoader.add_xpath` and :meth:`~ItemLoader.add_value` calls.
.. _topics-loaders-processors:
Input and Output processors
===========================
An Item Loader contains one input processor and one output processor for each
(item) field. The input processor processes the extracted data as soon as it's
received (through the :meth:`~XPathItemLoader.add_xpath` or
:meth:`~ItemLoader.add_value` methods) and the result of the input processor is
collected and kept inside the ItemLoader. After collecting all data, the
:meth:`ItemLoader.populate_item` method is called to populate and get the
populated :class:`~scrapy.newitem.Item` object. That's when the output processor
is called with the data previously collected (and processed using the input
processor). The result of the output processor is the final value that gets assigned
to the item.
Let's see an example to illustrate how this input and output processors are
called for a particular field (the same applies for any other field)::
p = XPathItemLoader(Product(), some_xpath_selector)
p.add_xpath('name', xpath1) # (1)
p.add_xpath('name', xpath2) # (2)
return p.populate_item() # (3)
So what happens is:
1. Data from ``xpath1`` is extracted, and passed through the *input processor* of
the ``name`` field. The result of the input processor is collected and kept in
the Item Loader (but not yet assigned to the item).
2. Data from ``xpath2`` is extracted, and passed through the same *input
processor* used in (1). The result of the input processor is appended to the
data collected in (1) (if any).
3. The data collected in (1) and (2) is passed through the *output processor* of
the ``name`` field. The result of the output processor is the value assigned to
the ``name`` field in the item.
It's worth noticing that processors are just callable objects, which are called
with the data to be parsed, and return a parsed value. So you can use any
function as input or output processor, provided they can receive only one
positional (required) argument.
The other thing you need to keep in mind is that the values returned by input
processors are collected internally (in lists) and then passed to output
processors to populate the fields, so output processors should expect iterables as
input.
Last, but not least, Scrapy comes with some :ref:`commonly used processors
<topics-loaders-available-processors>` built-in for convenience.
Declaring Item Loaders
======================
Item Loaders are declared like Items, by using a class definition syntax. Here
is an example::
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst, ApplyConcat, Join
class ProductLoader(ItemLoader):
default_input_processor = TakeFirst()
name_in = ApplyConcat(unicode.title)
name_out = Join()
price_in = ApplyConcat(unicode.strip)
price_out = TakeFirst()
# ...
As you can see, input processors are declared using the ``_in`` suffix while
output processors are declared using the ``_out`` suffix. And you can also
declare a default input/output processors using the
:attr:`ItemLoader.default_input_processor` and
:attr:`ItemLoader.default_output_processor` attributes.
.. _topics-loaders-processors-declaring:
Declaring Input and Output Processors
=====================================
As seen in the previous section, input and output processors can be declared in
the Item Loader definition, and it's very common to declare input processors
this way. However, there is one more place where you can specify the input and
output processors to use: in the :ref:`Item Field <topics-newitems-fields>`
metadata. Here is an example::
from scrapy.newitem import Item, Field
from scrapy.contrib.loader.processor import ApplyConcat, Join, TakeFirst
from scrapy.utils.markup import remove_entities
from myproject.utils import filter_prices
class Product(Item):
name = Field(
input_processor=ApplyConcat(remove_entities),
output_processor=Join(),
)
price = Field(
default=0,
input_processor=ApplyConcat(remove_entities, filter_prices),
output_processor=TakeFirst(),
)
The precedence order, for both input and output processors, is as follows:
1. Item Loader field-specific attributes: ``field_in`` and ``field_out`` (most
precedence)
2. Field metadata (``input_processor`` and ``output_processor`` key)
3. Item Loader defaults: :meth:`ItemLoader.default_input_processor` and
:meth:`ItemLoader.default_output_processor` (least precedence)
See also: :ref:`topics-loaders-extending`.
.. _topics-loaders-context:
Item Loader Context
===================
The Item Loader Context is a dict of arbitrary key/values which is shared among
all input and output processors in the Item Loader. It can be passed when
declaring, instantiating or using Item Loader. They are used to modify the
behaviour of the input/output processors.
For example, suppose you have a function ``parse_length`` which receives a text
value and extracts a length from it::
def parse_length(text, loader_context):
unit = loader_context.get('unit', 'm')
# ... length parsing code goes here ...
return parsed_length
By accepting a ``loader_context`` argument the function is explicitly telling
the Item Loader that is able to receive an Item Loader context, so the Item
Loader passes the currently active context when calling it, and the processor
function (``parse_length`` in this case) can thus use them.
There are several ways to modify Item Loader context values:
1. By modifying the currently active Item Loader context
(:meth:`ItemLoader.context` attribute)::
loader = ItemLoader(product, unit='cm')
loader.context['unit'] = 'cm'
2. On Item Loader instantiation (the keyword arguments of Item Loader
constructor are stored in the Item Loader context)::
p = ItemLoader(product, unit='cm')
2. On Item Loader declaration, for those input/output processors that support
instatiating them with a Item Loader context. :class:`ApplyConcat` is one of
them::
class ProductLoader(ItemLoader):
length_out = ApplyConcat(parse_length, unit='cm')
ItemLoader objects
==================
.. class:: ItemLoader([item], \**kwargs)
Return a new Item Loader for populating the given Item. If no item is
given, one is instantiated automatically using the class in
:attr:`default_item_class`.
The item and the remaining keyword arguments are assigned to the Loader
context (accesible through the :attr:`context` attribute).
.. method:: add_value(field_name, value)
Add the given ``value`` for the given field.
The value is passed through the :ref:`field input processor
<topics-loaders-processors>` and its result appened to the data
collected for that field. If the field already contains collected data,
the new data is added.
Examples::
loader.add_value('name', u'Color TV')
loader.add_value('colours', [u'white', u'blue'])
loader.add_value('length', u'100', default_unit='cm')
.. method:: replace_value(field_name, value)
Similar to :meth:`add_value` but replaces the collected data with the
new value instead of adding it.
.. method:: populate_item()
Populate the item with the data collected so far, and return it. The
data collected is first passed through the :ref:`field output processors
<topics-loaders-processors>` to get the final value to assign to each
item field.
.. method:: get_collected_values(field_name)
Return the collected values for the given field.
.. method:: get_output_value(field_name)
Return the collected values parsed using the output processor, for the
given field. This method doesn't populate or modify the item at all.
.. method:: get_input_processor(field_name)
Return the input processor for the given field.
.. method:: get_output_processor(field_name)
Return the output processor for the given field.
.. attribute:: item
The :class:`~scrapy.newitem.Item` object being parsed by this Item
Loader.
.. attribute:: context
The currently active :ref:`Context <topics-loaders-context>` of this
Item Loader.
.. attribute:: default_item_class
An Item class (or factory), used to instantiate items when not given in
the constructor.
.. attribute:: default_input_processor
The default input processor to use for those fields which don't specify
one.
.. attribute:: default_output_processor
The default output processor to use for those fields which don't specify
one.
.. class:: XPathItemLoader([item, selector, response], \**kwargs)
The :class:`XPathItemLoader` class extends the :class:`ItemLoader` class
providing more convenient mechanisms for extracting data from web pages
using :ref:`XPath selectors <topics-selectors>`.
:class:`XPathItemLoader` objects accept two more additional parameters in
their constructors:
:param selector: The selector to extract data from, when using the
:meth:`add_xpath` or :meth:`replace_xpath` method.
:type selector: :class:`~scrapy.xpath.XPathSelector` object
:param response: The response used to construct the selector using the
:attr:`default_selector_class`, unless the selector argument is given,
in which case this argument is ignored.
:type response: :class:`~scrapy.http.Response` object
.. method:: add_xpath(field_name, xpath, re=None)
Similar to :meth:`ItemLoader.add_value` but receives an XPath instead of a
value, which is used to extract a list of unicode strings from the
selector associated with this :class:`XPathItemLoader`. If the ``re``
argument is given, it's used for extrating data from the selector using
the :meth:`~scrapy.xpath.XPathSelector.re` method.
:param xpath: the XPath to extract data from
:type xpath: str
:param re: a regular expression to use for extracting data from the
selected XPath region
:type re: str or compiled regex
Examples::
# HTML snippet: <p class="product-name">Color TV</p>
loader.add_xpath('name', '//p[@class="product-name"]')
# HTML snippet: <p id="price">the price is $1200</p>
loader.add_xpath('price', '//p[@id="price"]', re='the price is (.*)')
.. method:: replace_xpath(field_name, xpath, re=None)
Similar to :meth:`add_xpath` but replaces collected data instead of
adding it.
.. attribute:: default_selector_class
The class used to construct the :attr:`selector` of this
:class:`XPathItemLoader`, if only a response is given in the constructor.
If a selector is given in the constructor this attribute is ignored.
This attribute is sometimes overridden in subclasses.
.. attribute:: selector
The :class:`~scrapy.xpath.XPathSelector` object to extract data from.
It's either the selector given in the constructor or one created from
the response given in the constructor using the
:attr:`default_selector_class`. This attribute is meant to be
read-only.
.. _topics-loaders-extending:
Reusing and extending Item Loaders
==================================
As your project grows bigger and acquires more and more spiders, maintenance
becomes a fundamental problem, specially when you have to deal with many
different parsing rules for each spider, having a lot of exceptions, but also
wanting to reuse the common processors.
Item Loaders are designed to ease the maintenance burden of parsing rules,
without loosing flexibility and, at the same time, providing a convenient
mechanism for extending and overriding them. For this reason Item Loaders
support traditional Python class inheritance for dealing with differences of
specific spiders (or group of spiders).
Suppose, for example, that some particular site encloses their product names in
three dashes (ie. ``---Plasma TV---``) and you don't want to end up scraping
those dashes in the final product names.
Here's how you can remove those dashes by reusing and extending the default
Product Item Loader (``ProductLoader``)::
from scrapy.contrib.loader.processor import ApplyConcat
from myproject.ItemLoaders import ProductLoader
def strip_dashes(x):
return x.strip('-')
class SiteSpecificLoader(ProductLoader):
name_in = ApplyConcat(ProductLoader.name_in, strip_dashes)
Another case where extending Item Loaders can be very helpful is when you have
multiple source formats, for example XML and HTML. In the XML version you may
want to remove ``CDATA`` occurrences. Here's an example of how to do it::
from scrapy.contrib.loader.processor import ApplyConcat
from myproject.ItemLoaders import ProductLoader
from myproject.utils.xml import remove_cdata
class XmlProductLoader(ProductLoader):
name_in = ApplyConcat(remove_cdata, ProductLoader.name_in)
And that's how you typically extend input processors.
As for output processors, it is more common to declare them in the field metadata,
as they usually depend only on the field and not on each specific site parsing
rule (as input processors do). See also:
:ref:`topics-loaders-processors-declaring`.
There are many other possible ways to extend, inherit and override your Item
Loaders, and different Item Loaders hierarchies may fit better for different
projects. Scrapy only provides the mechanism, it doesn't impose any specific
organization of your Loaders collection - that's up to you and your project
needs.
.. _topics-loaders-available-processors:
Available built-in processors
=============================
Even though you can use any callable function as input and output processors,
Scrapy provides some commonly used processors, which are described below. Some
of them, like the :class:`ApplyConcat` (which is typically used as input
processor) composes the output of several functions executed in order, to
produce the final parsed value.
Here is a list of all built-in processors:
.. _topics-loaders-applyconcat:
ApplyConcat processor
---------------------
The ApplyConcat processor is the recommended processor to use if you want to
concatenate the processing of several functions in a pipeline.
.. module:: scrapy.contrib.loader.processor
:synopsis: A collection of processors to use with Item Loaders
.. class:: ApplyConcat(\*functions, \**default_loader_context)
A processor which applies the given functions consecutively, in order,
concatenating their results before next function call. So each function
returns a list of values (though it could return ``None`` or a signle value
too) and the next function is called once for each of those values,
receiving one of those values as input each time. The output of each
function call (for each input value) is concatenated and each values of the
concatenation is used to call the next function, and the process repeats
until there are no functions left.
Each function can optionally receive a ``loader_context`` parameter, which
will contain the currently active :ref:`Item Loader context
<topics-loaders-context>`.
The keyword arguments passed in the consturctor are used as the default
Item Loader context values passed on each function call. However, the final
Item Loader context values passed to funtions get overriden with the
currently active Item Loader context accesible through the
:meth:`ItemLoader.context` attribute.
Example::
>>> def filter_world(x):
... return None if x == 'world' else x
...
>>> from scrapy.contrib.loader.processor import ApplyConcat
>>> proc = ApplyConcat(filter_world, str.upper)
>>> proc(['hello', 'world', 'this', 'is', 'scrapy'])
['HELLO, 'THIS', 'IS', 'SCRAPY']
.. class:: TakeFirst
Return the first non null/empty value from the values to received, so it's
typically used as output processor of single-valued fields. It doesn't
receive any constructor arguments, nor accepts a Item Loader context.
Example::
>>> from scrapy.contrib.loader.processor import TakeFirst
>>> proc = TakeFirst()
>>> proc(['', 'one', 'two', 'three'])
'one'
.. class:: Identity
Return the original values unchanged. It doesn't receive any constructor
arguments nor accepts a Item Loader context.
Example::
>>> from scrapy.contrib.loader.processor import Identity
>>> proc = Identity()
>>> proc(['one', 'two', 'three'])
['one', 'two', 'three']
.. class:: Join(separator=u' ')
Return the values joined with the separator given in the constructor, which
defaults to ``u' '``. It doesn't accept a Item Loader context.
When using the default separator, this processor is equivalent to the
function: ``u' '.join``
Examples::
>>> from scrapy.contrib.loader.processor import Join
>>> proc = Join()
>>> proc(['one', 'two', 'three'])
u'one two three'
>>> proc = Join('<br>')
>>> proc(['one', 'two', 'three'])
u'one<br>two<br>three'

View File

@ -1,13 +0,0 @@
"""Common functions used in Item Parsers code"""
from functools import partial
from scrapy.utils.python import get_func_args
def wrap_parser_context(function, context):
"""Wrap functions that receive parser_context to contain those parser
arguments pre-loaded and expose a interface that receives only one argument
"""
if 'parser_context' in get_func_args(function):
return partial(function, parser_context=context)
else:
return function

View File

@ -1,7 +1,7 @@
"""
Item Parser
Item Loader
See documentation in docs/topics/itemparser.rst
See documentation in docs/topics/loaders.rst
"""
from collections import defaultdict
@ -9,14 +9,14 @@ from collections import defaultdict
from scrapy.newitem import Item
from scrapy.xpath import HtmlXPathSelector
from scrapy.utils.misc import arg_to_iter
from .common import wrap_parser_context
from .parsers import Identity
from .common import wrap_loader_context
from .processor import Identity
class ItemParser(object):
class ItemLoader(object):
default_item_class = Item
default_input_parser = Identity()
default_output_parser = Identity()
default_input_processor = Identity()
default_output_processor = Identity()
def __init__(self, item=None, **context):
if item is None:
@ -40,34 +40,34 @@ class ItemParser(object):
return item
def get_output_value(self, field_name):
parser = self.get_output_parser(field_name)
parser = wrap_parser_context(parser, self.context)
return parser(self._values[field_name])
proc = self.get_output_processor(field_name)
proc = wrap_loader_context(proc, self.context)
return proc(self._values[field_name])
def get_collected_values(self, field_name):
return self._values[field_name]
def get_input_parser(self, field_name):
parser = getattr(self, '%s_in' % field_name, None)
if not parser:
parser = self.item.fields[field_name].get('input_parser', \
self.default_input_parser)
return parser
def get_input_processor(self, field_name):
proc = getattr(self, '%s_in' % field_name, None)
if not proc:
proc = self.item.fields[field_name].get('input_processor', \
self.default_input_processor)
return proc
def get_output_parser(self, field_name):
parser = getattr(self, '%s_out' % field_name, None)
if not parser:
parser = self.item.fields[field_name].get('output_parser', \
self.default_output_parser)
return parser
def get_output_processor(self, field_name):
proc = getattr(self, '%s_out' % field_name, None)
if not proc:
proc = self.item.fields[field_name].get('output_processor', \
self.default_output_processor)
return proc
def _parse_input_value(self, field_name, value):
parser = self.get_input_parser(field_name)
parser = wrap_parser_context(parser, self.context)
return parser(value)
proc = self.get_input_processor(field_name)
proc = wrap_loader_context(proc, self.context)
return proc(value)
class XPathItemParser(ItemParser):
class XPathItemLoader(ItemLoader):
default_selector_class = HtmlXPathSelector
@ -79,7 +79,7 @@ class XPathItemParser(ItemParser):
selector = self.default_selector_class(response)
self.selector = selector
context.update(selector=selector, response=response)
super(XPathItemParser, self).__init__(item, **context)
super(XPathItemLoader, self).__init__(item, **context)
def add_xpath(self, field_name, xpath, re=None):
self.add_value(field_name, self._get_values(field_name, xpath, re))

View File

@ -0,0 +1,13 @@
"""Common functions used in Item Loaders code"""
from functools import partial
from scrapy.utils.python import get_func_args
def wrap_loader_context(function, context):
"""Wrap functions that receive loader_context to contain the context
"pre-loaded" and expose a interface that receives only one argument
"""
if 'loader_context' in get_func_args(function):
return partial(function, loader_context=context)
else:
return function

View File

@ -1,26 +1,26 @@
"""
This module provides some commonly used parser functions for Item Parsers.
This module provides some commonly used processors for Item Loaders.
See documentation in docs/topics/itemparser.rst
See documentation in docs/topics/loaders.rst
"""
from scrapy.utils.misc import arg_to_iter
from scrapy.utils.datatypes import MergeDict
from .common import wrap_parser_context
from .common import wrap_loader_context
class ApplyConcat(object):
def __init__(self, *functions, **default_parser_context):
def __init__(self, *functions, **default_loader_context):
self.functions = functions
self.default_parser_context = default_parser_context
self.default_loader_context = default_loader_context
def __call__(self, value, parser_context=None):
def __call__(self, value, loader_context=None):
values = arg_to_iter(value)
if parser_context:
context = MergeDict(parser_context, self.default_parser_context)
if loader_context:
context = MergeDict(loader_context, self.default_loader_context)
else:
context = self.default_parser_context
wrapped_funcs = [wrap_parser_context(f, context) for f in self.functions]
context = self.default_loader_context
wrapped_funcs = [wrap_loader_context(f, context) for f in self.functions]
for func in wrapped_funcs:
next_values = []
for v in values:

View File

@ -1,7 +1,7 @@
import unittest
from scrapy.contrib.itemparser import ItemParser, XPathItemParser
from scrapy.contrib.itemparser.parsers import ApplyConcat, Join, Identity
from scrapy.contrib.loader import ItemLoader, XPathItemLoader
from scrapy.contrib.loader.processor import ApplyConcat, Join, Identity
from scrapy.newitem import Item, Field
from scrapy.xpath import HtmlXPathSelector
from scrapy.http import HtmlResponse
@ -15,30 +15,30 @@ class TestItem(NameItem):
url = Field()
summary = Field()
# test item parsers
# test item loaders
class NameItemParser(ItemParser):
class NameItemLoader(ItemLoader):
default_item_class = TestItem
class TestItemParser(NameItemParser):
class TestItemLoader(NameItemLoader):
name_in = ApplyConcat(lambda v: v.title())
class DefaultedItemParser(NameItemParser):
default_input_parser = ApplyConcat(lambda v: v[:-1])
class DefaultedItemLoader(NameItemLoader):
default_input_processor = ApplyConcat(lambda v: v[:-1])
# test parsers
# test processors
def parser_with_args(value, other=None, parser_context=None):
if 'key' in parser_context:
return parser_context['key']
def processor_with_args(value, other=None, loader_context=None):
if 'key' in loader_context:
return loader_context['key']
return value
class ItemParserTest(unittest.TestCase):
class ItemLoaderTest(unittest.TestCase):
def test_populate_item_using_default_loader(self):
i = TestItem()
i['summary'] = u'lala'
ip = ItemParser(item=i)
ip = ItemLoader(item=i)
ip.add_value('name', u'marta')
item = ip.populate_item()
assert item is i
@ -46,13 +46,13 @@ class ItemParserTest(unittest.TestCase):
self.assertEqual(item['name'], [u'marta'])
def test_populate_item_using_custom_loader(self):
ip = TestItemParser()
ip = TestItemLoader()
ip.add_value('name', u'marta')
item = ip.populate_item()
self.assertEqual(item['name'], [u'Marta'])
def test_add_value(self):
ip = TestItemParser()
ip = TestItemLoader()
ip.add_value('name', u'marta')
self.assertEqual(ip.get_collected_values('name'), [u'Marta'])
self.assertEqual(ip.get_output_value('name'), [u'Marta'])
@ -61,7 +61,7 @@ class ItemParserTest(unittest.TestCase):
self.assertEqual(ip.get_output_value('name'), [u'Marta', u'Pepe'])
def test_replace_value(self):
ip = TestItemParser()
ip = TestItemLoader()
ip.replace_value('name', u'marta')
self.assertEqual(ip.get_collected_values('name'), [u'Marta'])
self.assertEqual(ip.get_output_value('name'), [u'Marta'])
@ -69,208 +69,208 @@ class ItemParserTest(unittest.TestCase):
self.assertEqual(ip.get_collected_values('name'), [u'Pepe'])
self.assertEqual(ip.get_output_value('name'), [u'Pepe'])
def test_map_concat_filter(self):
def test_apply_concat_filter(self):
def filter_world(x):
return None if x == 'world' else x
parser = ApplyConcat(filter_world, str.upper)
self.assertEqual(parser(['hello', 'world', 'this', 'is', 'scrapy']),
proc = ApplyConcat(filter_world, str.upper)
self.assertEqual(proc(['hello', 'world', 'this', 'is', 'scrapy']),
['HELLO', 'THIS', 'IS', 'SCRAPY'])
def test_map_concat_filter_multiple_functions(self):
class TestItemParser(NameItemParser):
class TestItemLoader(NameItemLoader):
name_in = ApplyConcat(lambda v: v.title(), lambda v: v[:-1])
ip = TestItemParser()
ip = TestItemLoader()
ip.add_value('name', u'marta')
self.assertEqual(ip.get_output_value('name'), [u'Mart'])
item = ip.populate_item()
self.assertEqual(item['name'], [u'Mart'])
def test_default_input_parser(self):
ip = DefaultedItemParser()
def test_default_input_processor(self):
ip = DefaultedItemLoader()
ip.add_value('name', u'marta')
self.assertEqual(ip.get_output_value('name'), [u'mart'])
def test_inherited_default_input_parser(self):
class InheritDefaultedItemParser(DefaultedItemParser):
def test_inherited_default_input_processor(self):
class InheritDefaultedItemLoader(DefaultedItemLoader):
pass
ip = InheritDefaultedItemParser()
ip = InheritDefaultedItemLoader()
ip.add_value('name', u'marta')
self.assertEqual(ip.get_output_value('name'), [u'mart'])
def test_input_parser_inheritance(self):
class ChildItemParser(TestItemParser):
def test_input_processor_inheritance(self):
class ChildItemLoader(TestItemLoader):
url_in = ApplyConcat(lambda v: v.lower())
ip = ChildItemParser()
ip = ChildItemLoader()
ip.add_value('url', u'HTTP://scrapy.ORG')
self.assertEqual(ip.get_output_value('url'), [u'http://scrapy.org'])
ip.add_value('name', u'marta')
self.assertEqual(ip.get_output_value('name'), [u'Marta'])
class ChildChildItemParser(ChildItemParser):
class ChildChildItemLoader(ChildItemLoader):
url_in = ApplyConcat(lambda v: v.upper())
summary_in = ApplyConcat(lambda v: v)
ip = ChildChildItemParser()
ip = ChildChildItemLoader()
ip.add_value('url', u'http://scrapy.org')
self.assertEqual(ip.get_output_value('url'), [u'HTTP://SCRAPY.ORG'])
ip.add_value('name', u'marta')
self.assertEqual(ip.get_output_value('name'), [u'Marta'])
def test_empty_map_concat(self):
class IdentityDefaultedItemParser(DefaultedItemParser):
class IdentityDefaultedItemLoader(DefaultedItemLoader):
name_in = ApplyConcat()
ip = IdentityDefaultedItemParser()
ip = IdentityDefaultedItemLoader()
ip.add_value('name', u'marta')
self.assertEqual(ip.get_output_value('name'), [u'marta'])
def test_identity_input_parser(self):
class IdentityDefaultedItemParser(DefaultedItemParser):
def test_identity_input_processor(self):
class IdentityDefaultedItemLoader(DefaultedItemLoader):
name_in = Identity()
ip = IdentityDefaultedItemParser()
ip = IdentityDefaultedItemLoader()
ip.add_value('name', u'marta')
self.assertEqual(ip.get_output_value('name'), [u'marta'])
def test_extend_custom_input_parsers(self):
class ChildItemParser(TestItemParser):
name_in = ApplyConcat(TestItemParser.name_in, unicode.swapcase)
def test_extend_custom_input_processors(self):
class ChildItemLoader(TestItemLoader):
name_in = ApplyConcat(TestItemLoader.name_in, unicode.swapcase)
ip = ChildItemParser()
ip = ChildItemLoader()
ip.add_value('name', u'marta')
self.assertEqual(ip.get_output_value('name'), [u'mARTA'])
def test_extend_default_input_parsers(self):
class ChildDefaultedItemParser(DefaultedItemParser):
name_in = ApplyConcat(DefaultedItemParser.default_input_parser, unicode.swapcase)
def test_extend_default_input_processors(self):
class ChildDefaultedItemLoader(DefaultedItemLoader):
name_in = ApplyConcat(DefaultedItemLoader.default_input_processor, unicode.swapcase)
ip = ChildDefaultedItemParser()
ip = ChildDefaultedItemLoader()
ip.add_value('name', u'marta')
self.assertEqual(ip.get_output_value('name'), [u'MART'])
def test_output_parser_using_function(self):
ip = TestItemParser()
def test_output_processor_using_function(self):
ip = TestItemLoader()
ip.add_value('name', [u'mar', u'ta'])
self.assertEqual(ip.get_output_value('name'), [u'Mar', u'Ta'])
class TakeFirstItemParser(TestItemParser):
class TakeFirstItemLoader(TestItemLoader):
name_out = u" ".join
ip = TakeFirstItemParser()
ip = TakeFirstItemLoader()
ip.add_value('name', [u'mar', u'ta'])
self.assertEqual(ip.get_output_value('name'), u'Mar Ta')
def test_output_parser_using_classes(self):
ip = TestItemParser()
def test_output_processor_using_classes(self):
ip = TestItemLoader()
ip.add_value('name', [u'mar', u'ta'])
self.assertEqual(ip.get_output_value('name'), [u'Mar', u'Ta'])
class TakeFirstItemParser(TestItemParser):
class TakeFirstItemLoader(TestItemLoader):
name_out = Join()
ip = TakeFirstItemParser()
ip = TakeFirstItemLoader()
ip.add_value('name', [u'mar', u'ta'])
self.assertEqual(ip.get_output_value('name'), u'Mar Ta')
class TakeFirstItemParser(TestItemParser):
class TakeFirstItemLoader(TestItemLoader):
name_out = Join("<br>")
ip = TakeFirstItemParser()
ip = TakeFirstItemLoader()
ip.add_value('name', [u'mar', u'ta'])
self.assertEqual(ip.get_output_value('name'), u'Mar<br>Ta')
def test_default_output_parser(self):
ip = TestItemParser()
def test_default_output_processor(self):
ip = TestItemLoader()
ip.add_value('name', [u'mar', u'ta'])
self.assertEqual(ip.get_output_value('name'), [u'Mar', u'Ta'])
class LalaItemParser(TestItemParser):
default_output_parser = Identity()
class LalaItemLoader(TestItemLoader):
default_output_processor = Identity()
ip = LalaItemParser()
ip = LalaItemLoader()
ip.add_value('name', [u'mar', u'ta'])
self.assertEqual(ip.get_output_value('name'), [u'Mar', u'Ta'])
def test_parser_context_on_declaration(self):
class ChildItemParser(TestItemParser):
url_in = ApplyConcat(parser_with_args, key=u'val')
def test_loader_context_on_declaration(self):
class ChildItemLoader(TestItemLoader):
url_in = ApplyConcat(processor_with_args, key=u'val')
ip = ChildItemParser()
ip = ChildItemLoader()
ip.add_value('url', u'text')
self.assertEqual(ip.get_output_value('url'), ['val'])
ip.replace_value('url', u'text2')
self.assertEqual(ip.get_output_value('url'), ['val'])
def test_parser_context_on_instantiation(self):
class ChildItemParser(TestItemParser):
url_in = ApplyConcat(parser_with_args)
def test_loader_context_on_instantiation(self):
class ChildItemLoader(TestItemLoader):
url_in = ApplyConcat(processor_with_args)
ip = ChildItemParser(key=u'val')
ip = ChildItemLoader(key=u'val')
ip.add_value('url', u'text')
self.assertEqual(ip.get_output_value('url'), ['val'])
ip.replace_value('url', u'text2')
self.assertEqual(ip.get_output_value('url'), ['val'])
def test_parser_context_on_assign(self):
class ChildItemParser(TestItemParser):
url_in = ApplyConcat(parser_with_args)
def test_loader_context_on_assign(self):
class ChildItemLoader(TestItemLoader):
url_in = ApplyConcat(processor_with_args)
ip = ChildItemParser()
ip = ChildItemLoader()
ip.context['key'] = u'val'
ip.add_value('url', u'text')
self.assertEqual(ip.get_output_value('url'), ['val'])
ip.replace_value('url', u'text2')
self.assertEqual(ip.get_output_value('url'), ['val'])
def test_item_passed_to_input_parser_functions(self):
def parser(value, parser_context):
return parser_context['item']['name']
def test_item_passed_to_input_processor_functions(self):
def processor(value, loader_context):
return loader_context['item']['name']
class ChildItemParser(TestItemParser):
url_in = ApplyConcat(parser)
class ChildItemLoader(TestItemLoader):
url_in = ApplyConcat(processor)
it = TestItem(name='marta')
ip = ChildItemParser(item=it)
ip = ChildItemLoader(item=it)
ip.add_value('url', u'text')
self.assertEqual(ip.get_output_value('url'), ['marta'])
ip.replace_value('url', u'text2')
self.assertEqual(ip.get_output_value('url'), ['marta'])
def test_add_value_on_unknown_field(self):
ip = TestItemParser()
ip = TestItemLoader()
self.assertRaises(KeyError, ip.add_value, 'wrong_field', [u'lala', u'lolo'])
class TestXPathItemParser(XPathItemParser):
class TestXPathItemLoader(XPathItemLoader):
default_item_class = TestItem
name_in = ApplyConcat(lambda v: v.title())
class XPathItemParserTest(unittest.TestCase):
class XPathItemLoaderTest(unittest.TestCase):
def test_constructor_errors(self):
self.assertRaises(RuntimeError, XPathItemParser)
self.assertRaises(RuntimeError, XPathItemLoader)
def test_constructor_with_selector(self):
sel = HtmlXPathSelector(text=u"<html><body><div>marta</div></body></html>")
l = TestXPathItemParser(selector=sel)
l = TestXPathItemLoader(selector=sel)
self.assert_(l.selector is sel)
l.add_xpath('name', '//div/text()')
self.assertEqual(l.get_output_value('name'), [u'Marta'])
def test_constructor_with_response(self):
response = HtmlResponse(url="", body="<html><body><div>marta</div></body></html>")
l = TestXPathItemParser(response=response)
l = TestXPathItemLoader(response=response)
self.assert_(l.selector)
l.add_xpath('name', '//div/text()')
self.assertEqual(l.get_output_value('name'), [u'Marta'])
def test_add_xpath_re(self):
response = HtmlResponse(url="", body="<html><body><div>marta</div></body></html>")
l = TestXPathItemParser(response=response)
l = TestXPathItemLoader(response=response)
l.add_xpath('name', '//div/text()', re='ma')
self.assertEqual(l.get_output_value('name'), [u'Ma'])