mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-25 14:43:46 +00:00
Several corrections made to items and adaptors documentation
--HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40804
This commit is contained in:
parent
d9a90a1f3f
commit
648753190c
@ -12,73 +12,93 @@ Adaptors
|
||||
Quick overview
|
||||
==============
|
||||
|
||||
Scrapy's adaptors are a nice feature attached to RobustScrapedItems that allow you
|
||||
to easily modify (adapt to your needs) any kind of information you want to put in your items at assignation time.
|
||||
Scrapy's adaptors are a nice feature attached to :class:`RobustScrapedItem`
|
||||
that allow you to easily modify (adapt to your needs) any kind of information
|
||||
you want to put in your items at assignation time.
|
||||
|
||||
The following diagram shows the data flow from the moment you call the `attribute` method until the attribute is
|
||||
actually set.
|
||||
The following diagram shows the data flow from the moment you call the
|
||||
``attribute`` method until the attribute is actually set.
|
||||
|
||||
.. image:: _images/adaptors_diagram.png
|
||||
|
||||
As you can see, adaptor pipelines are executed in tree form; which means that, for each of the values you pass to
|
||||
the `attribute` method, the first adaptor will be applied. Then, for each of the resulting values of the first adaptor,
|
||||
the second adaptor will be called, and so on.
|
||||
This process will end up with a list of adapted values, which may contain zero, one, or many values.
|
||||
As you can see, adaptor pipelines are executed in tree form; which means that,
|
||||
for each of the values you pass to the ``attribute`` method, the first adaptor
|
||||
will be applied. Then, for each of the resulting values of the first adaptor,
|
||||
the second adaptor will be called, and so on. This process will end up with a
|
||||
list of adapted values, which may contain zero, one, or many values.
|
||||
|
||||
In case the attribute is a single-valued (this is defined in the item's ATTRIBUTES dictionary), the first element of this
|
||||
list will be set, unless you call the `attribute` method with the add parameter as True, in which case the item's method
|
||||
`_add_single_attributes` will be called with the attribute's name, type, and the list of attributes to join as parameters.
|
||||
By default, this method raises NotImplementedError, so you should override it in your items in order to join any kind of objects.
|
||||
In case the attribute is a single-valued (this is defined in the item's
|
||||
``ATTRIBUTES`` dictionary), the first element of this list will be set, unless
|
||||
you call the ``attribute`` method with the add parameter as True, in which case
|
||||
the item's method ``_add_single_attributes`` will be called with the
|
||||
attribute's name, type, and the list of attributes to join as parameters. By
|
||||
default, this method raises NotImplementedError, so you should override it in
|
||||
your items in order to join any kind of objects.
|
||||
|
||||
If the attribute is a multivalued, the resulting list will be set to the item as is, unless you use -again- add=True,
|
||||
in which case the list of already-existing values (if any) will be extended with the new one.
|
||||
If the attribute is a multivalued, the resulting list will be set to the item
|
||||
as is, unless you use -again- add=True, in which case the list of
|
||||
already-existing values (if any) will be extended with the new one.pgq
|
||||
|
||||
Adaptor Pipelines
|
||||
=================
|
||||
|
||||
.. class:: AdaptorPipe(adaptors=None)
|
||||
|
||||
An instance of this class represents an adaptor pipeline to be set for adapting a certain
|
||||
item's attribute.
|
||||
It provides some useful methods for adding/removing adaptors, and takes care of executing them properly.
|
||||
Usually this class is not used directly, since the items already provide ways to handle adaptors without
|
||||
having to manage AdaptorPipes.
|
||||
An instance of this class represents an adaptor pipeline to be set for
|
||||
adapting a certain item's attribute. It provides some useful methods for
|
||||
adding/removing adaptors, and takes care of executing them properly.
|
||||
Usually this class is not used directly, since the items already provide
|
||||
ways to manage adaptors without having to handle AdaptorPipes.
|
||||
|
||||
:param adaptors: A list of callables to be added as adaptors at instancing time.
|
||||
:param adaptors: A list of callables to be added as adaptors at
|
||||
instancing time.
|
||||
|
||||
Methods:
|
||||
|
||||
.. method:: add_adaptor(adaptor, position=None)
|
||||
|
||||
This method is used for adding adaptors to the pipeline given a certain position.
|
||||
This method is used for adding adaptors to the pipeline given
|
||||
a certain position.
|
||||
|
||||
:param adaptor: Any callable that works as an adaptor
|
||||
:param position: An integer meaning the position in which the adaptor will be inserted. If it's None
|
||||
the adaptor will be appended at the end of the pipeline.
|
||||
:param position: An integer meaning the position in which the adaptor
|
||||
will be inserted. If it's None the adaptor will be appended at
|
||||
the end of the pipeline.
|
||||
|
||||
Usage
|
||||
=====
|
||||
|
||||
As it was previously said, in order to use adaptor pipelines you must inherit your items from the RobustScrapedItem class.
|
||||
If you don't know anything about these items, read the :ref:`topics-items` reference first.
|
||||
As it was previously said, in order to use adaptor pipelines you must inherit
|
||||
your items from the :class:`RobustScrapedItem` class. If you don't know
|
||||
anything about these items, read the :ref:`topics-items` reference first.
|
||||
|
||||
Once you've created your own item class (inherited from RobustScrapedItem) with the attributes you're going to use,
|
||||
you have to add adaptor pipelines to each attribute you'd like to adapt data for.
|
||||
For doing so, RobustScrapedItems provide some useful methods (`set_adaptors`, `set_attrib_adaptors`, and more), which are
|
||||
also described in its reference.
|
||||
Once you've created your own item class (inherited from
|
||||
:class:`RobustScrapedItem`) with the attributes you're going to use, you have
|
||||
to add adaptor pipelines to each attribute you'd like to adapt data for. For
|
||||
doing so, RobustScrapedItems provide some useful methods like ``set_adaptors``,
|
||||
``set_attrib_adaptors``, and more (which are also described in its reference)
|
||||
so that you don't need to work with :class:`AdaptorPipe` objects directly.
|
||||
|
||||
But let's now talk a bit about adaptors (singularly), what are them, and how should they be implemented?
|
||||
Adaptors are basically, any callable that receives a value, modifies it, and returns a new value (or more) so that the next
|
||||
adaptor goes on with another adapting task (or not).
|
||||
This is done this way to make the process of modifying information very customizable, and also to make adaptors reusable,
|
||||
since they are intended to be small functions designed for simple purposes that can be applied in many different cases.
|
||||
For example, you could make an adaptor for removing any <b> tags in a text, like this::
|
||||
Adaptors
|
||||
--------
|
||||
|
||||
Let's now talk a bit about adaptors (singularly), what are them, and how
|
||||
should they be implemented?
|
||||
|
||||
Adaptors are basically, any callable that receives
|
||||
a value, modifies it, and returns a new value (or more) so that the next
|
||||
adaptor goes on with another adapting task (or not). This is done this way to
|
||||
make the process of modifying information very customizable, and also to make
|
||||
adaptors reusable, since they are intended to be small functions designed for
|
||||
simple purposes that can be applied in many different cases. For example, you
|
||||
could make an adaptor for removing any <b> tags in a text, like this::
|
||||
|
||||
>>> B_TAG_RE = re.compile(r'</?b\s*>')
|
||||
>>> def remove_b_tags(text):
|
||||
>>> return B_TAG_RE.sub('', text)
|
||||
|
||||
Then you could easily add this adaptor to a certain attribute's pipeline like this::
|
||||
Then you could easily add this adaptor to a certain attribute's pipeline like
|
||||
this::
|
||||
|
||||
>>> item = MyItem()
|
||||
>>> item.add_adaptor('text', remove_b_tags)
|
||||
@ -86,15 +106,18 @@ Then you could easily add this adaptor to a certain attribute's pipeline like th
|
||||
>>> item.text
|
||||
u'some random text in bold and some random text in normal font'
|
||||
|
||||
As you can see, this would make any value that you set to the item through the `attribute` method first pass through the
|
||||
`remove_b_tags` adaptor, which would also replace any matching tag with an empty string.
|
||||
As you can see, this would make any value that you set to the item through the
|
||||
``attribute`` method first pass through the ``remove_b_tags`` adaptor, which
|
||||
would also replace any matching tag with an empty string.
|
||||
|
||||
----
|
||||
|
||||
But anyway, let's now think of a bit more complicated (and useless) example: let's say you want to scrape a text, split it into single
|
||||
letters, strip the vowels, turn the rest to capital letters, and join them again.
|
||||
In this case, we could use three simple adaptors to process our data, plus a customized RobustScrapedItem for joining single
|
||||
text attributes; let's see an example::
|
||||
But anyway, let's now think of a bit more complicated (and useless) example:
|
||||
let's say you want to scrape a text, split it into single letters, strip the
|
||||
vowels, turn the rest to capital letters, and join them again. In this case,
|
||||
we could use three simple adaptors to process our data, plus a customized
|
||||
:class:`RobustScrapedItem` for joining single text attributes; let's see an
|
||||
example::
|
||||
|
||||
>>> # First of all, we define the item class we're going to use
|
||||
>>> from string import ascii_letters
|
||||
@ -132,15 +155,51 @@ Let's now try with an example text to see what happens::
|
||||
>>> item.text
|
||||
'PWND'
|
||||
|
||||
More complex adaptors
|
||||
---------------------
|
||||
|
||||
Now, after using adaptors a bit, you may find yourself in situations where you need
|
||||
to use adaptors that receive other parameters from the ``attribute`` method
|
||||
apart from the value to adapt.
|
||||
|
||||
For example, imagine you have an adaptor that removes certain characters from strings
|
||||
you provide. Would you make an adaptor for each combination of characters you'd like
|
||||
to strip? Of course not!
|
||||
|
||||
The way to handle this cases, is to make an adaptor that apart from receiving a value,
|
||||
as any other adaptor, receives a parameter called ``adaptor_args``.
|
||||
It's important that the parameter is called this way, since Scrapy finds out whether
|
||||
an adaptor is able to receive extra parameters or not by making instrospection
|
||||
and looking for a parameter called this way in the adaptor's parameters list.
|
||||
|
||||
The information this parameter will receive won't be anything else but the same dictionary
|
||||
of keyword arguments that you pass to the ``attribute`` method when calling it.
|
||||
|
||||
But let's get back to the characters example, how would we implement this?
|
||||
Quite simmilar to any other adaptor, let's see::
|
||||
|
||||
def strip_chars(value, adaptor_args):
|
||||
chars = adaptor_args.get('strip_chars', [])
|
||||
for char in chars:
|
||||
value = value.replace(char, '')
|
||||
return value
|
||||
|
||||
Then, after creating an item and adding the adaptor to one of its pipelines, we could do::
|
||||
|
||||
>>> item.attribute('text', 'Hi, my name is John', strip_chars=['a', 'i', 'm'])
|
||||
>>> item.text
|
||||
'H, y ne s John'
|
||||
|
||||
Debugging
|
||||
=========
|
||||
|
||||
While you're coding spiders and adaptors, you usually need to know exactly what does Scrapy
|
||||
do under the hood with the values you provide.
|
||||
There's a setting called :setting:`ADAPTORS_DEBUG` for this purpose that makes Scrapy print
|
||||
debugging messages each time an adaptors pipeline is run, specifying which attribute is being
|
||||
adapted data for, the input/output values of each adaptor in the pipeline, and the input/output
|
||||
of `_add_single_attributes` (in some cases).
|
||||
While you're coding spiders and adaptors, you usually need to know exactly what
|
||||
does Scrapy do under the hood with the values you provide. There's a setting
|
||||
called :setting:``ADAPTORS_DEBUG`` for this purpose that makes Scrapy print
|
||||
debugging messages each time an adaptors pipeline is run, specifying which
|
||||
attribute is being adapted data for, the input/output values of each adaptor in
|
||||
the pipeline, and the input/output of ``_add_single_attributes`` (in some
|
||||
cases).
|
||||
|
||||
You can enable this setting as any other, either by adding it to your settings file, or by enabling
|
||||
the environment variable `SCRAPY_ADAPTORS_DEBUG`.
|
||||
You can enable this setting as any other, either by adding it to your settings
|
||||
file, or by enabling the environment variable ``SCRAPY_ADAPTORS_DEBUG``.
|
||||
|
@ -7,8 +7,9 @@ Items
|
||||
Quick overview
|
||||
==============
|
||||
|
||||
| In Scrapy, items are the placeholder to use for the scraped data.
|
||||
They are represented by a :class:`ScrapedItem` object, or any descendant class instance, and store the information in class attributes.
|
||||
In Scrapy, items are the placeholder to use for the scraped data. They are
|
||||
represented by a :class:`ScrapedItem` object, or any descendant class instance,
|
||||
and store the information in class attributes.
|
||||
|
||||
ScrapedItems
|
||||
============
|
||||
@ -23,11 +24,13 @@ Methods
|
||||
|
||||
.. method:: ScrapedItem.__init__(data=None)
|
||||
|
||||
:param data: A dictionary containing attributes and values to be set after instancing the item.
|
||||
:param data: A dictionary containing attributes and values to be set
|
||||
after instancing the item.
|
||||
|
||||
Instanciates a ``ScrapedItem`` object and sets an attribute and its value for each key in the given ``data``
|
||||
dict (if any).
|
||||
These items are the most basic items available, and the common interface from which any items should inherit.
|
||||
Instanciates a ``ScrapedItem`` object and sets an attribute and its value
|
||||
for each key in the given ``data`` dict (if any). These items are the most
|
||||
basic items available, and the common interface from which any items should
|
||||
inherit.
|
||||
|
||||
Examples
|
||||
--------
|
||||
@ -56,62 +59,88 @@ RobustScrapedItems
|
||||
|
||||
.. class:: RobustScrapedItem
|
||||
|
||||
RobustScrapedItems are more complex items (compared to ScrapedItems) and have a few more features available, which
|
||||
RobustScrapedItems are more complex items (compared to
|
||||
:class:`ScrapedItem`) and have a few more features available, which
|
||||
include:
|
||||
|
||||
* Attributes dictionary: items that inherit from RobustScrapedItem are defined with a dictionary of attributes in the class.
|
||||
This allows the item to have much more logic at the moment of handling and setting attributes. The next features are
|
||||
built on top of this one.
|
||||
* Attributes dictionary: items that inherit from RobustScrapedItem are
|
||||
defined with a dictionary of attributes in the class. This allows the
|
||||
item to have more logic at the moment of handling and setting attributes
|
||||
than the :class:`ScrapedItem`.
|
||||
|
||||
* Adaptors: maybe the most important of the features that these items provide. The adaptors are a system designed for
|
||||
filtering/modifying data before setting it to the item, that makes cleansing tasks *a lot* easier.
|
||||
* Adaptors: perhaps the most important of the features these items provide.
|
||||
The adaptors are a system designed for filtering/modifying data before
|
||||
setting it to the item, that makes cleansing tasks a lot easier.
|
||||
|
||||
* Type checking: RobustScrapedItems come with a built-in type checking which assures you that no data of the wrong type will
|
||||
get into the items without raising a warning.
|
||||
* Type checking: RobustScrapedItems come with a built-in type checking
|
||||
which assures you that no data of the wrong type will get into the items
|
||||
without raising a warning.
|
||||
|
||||
* Versioning: These items also provide versioning by making a unique hash for each item based on its attributes values.
|
||||
* Versioning: These items also provide versioning by making a unique hash
|
||||
for each item based on its attributes values.
|
||||
|
||||
* ItemDeltas: You can subtract two RobustScrapedItems, which allows you to know the difference between a pair of items.
|
||||
This difference is represented by a RobustItemDelta object.
|
||||
* ItemDeltas: You can subtract two RobustScrapedItems, which allows you to
|
||||
know the difference between a pair of items. This difference is
|
||||
represented by a RobustItemDelta object.
|
||||
|
||||
Attributes
|
||||
----------
|
||||
|
||||
.. attribute:: RobustScrapedItem.ATTRIBUTES
|
||||
|
||||
This attribute **must** be specified when writing your items, and it's a
|
||||
dictionary in which the keys are the names of the attributes your item will
|
||||
have, and their values are the type of those attributes. For multivalued
|
||||
attributes, you should write the type of the values inside a list, e.g:
|
||||
``'numbers': [int]``
|
||||
|
||||
Methods
|
||||
-------
|
||||
|
||||
.. method:: RobustScrapedItem.__init__(data=None, adaptor_args=None)
|
||||
|
||||
:param data: Idem as in ScrapedItems
|
||||
:param adaptor_args: A dictionary of the like "attribute -> list of adaptors" for defining adaptors automatically after
|
||||
instancing the item.
|
||||
:param data: Idem as for ScrapedItems
|
||||
:param adaptor_args: A dictionary of the kind
|
||||
``'attribute': [list_of_adaptors]``" for defining adaptors automatically
|
||||
after instancing the item.
|
||||
|
||||
Constructor of RobustScrapedItem objects.
|
||||
|
||||
.. method:: RobustScrapedItem.attribute(self, attrname, value, override=False, add=False, ***kwargs)
|
||||
.. method:: RobustScrapedItem.attribute(attrname, value, override=False, add=False, ***kwargs)
|
||||
|
||||
Sets the item's ``attrname`` attribute with the given ``value`` filtering it through the given attribute's adaptor
|
||||
pipeline (if any).
|
||||
Sets the item's ``attrname`` attribute with the given ``value`` filtering
|
||||
it through the given attribute's adaptor pipeline (if any).
|
||||
|
||||
:param attrname: a string containing the name of the attribute you want to set.
|
||||
:param attrname: a string containing the name of the attribute you want
|
||||
to set.
|
||||
|
||||
:param value: the value you want to assign, which will be adapted by the corresponding adaptors for the given attribute (if any).
|
||||
:param value: the value you want to assign, which will be adapted by
|
||||
the corresponding adaptors for the given attribute (if any).
|
||||
|
||||
:param override: if True, makes this method avoid checking if there was a previous value and sets ``value`` no matter what.
|
||||
:param override: if True, makes this method avoid checking if there
|
||||
was a previous value and sets ``value`` no matter what.
|
||||
|
||||
:param add: if True, tries to concatenate the given ``value`` with the one already set in the item.
|
||||
For multivalued attributes, this will extend the list of already-set values, with the new ones.
|
||||
For single valued attributes, the method _add_single_attributes (which is explained below) will be called.
|
||||
:param add: if True, tries to concatenate the given ``value`` with the one
|
||||
already set in the item. For multivalued attributes, this will extend
|
||||
the list of already-set values, with the new ones.
|
||||
For single valued attributes, the method _add_single_attributes (which
|
||||
is explained below) will be called.
|
||||
|
||||
:param kwargs: any extra parameters will be passed in a dictionary to any adaptor that receives a parameter called 'adaptor_args'.
|
||||
:param kwargs: any extra parameters will be passed in a dictionary to any
|
||||
adaptor that receives a parameter called ``adaptor_args``.
|
||||
Check the :ref:`topics-adaptors` topic for more information.
|
||||
|
||||
.. method:: RobustScrapedItem.set_adaptors(self, adaptors_dict)
|
||||
.. method:: RobustScrapedItem.set_adaptors(adaptors_dict)
|
||||
|
||||
Receives a dict containing a list of adaptors for each desired attribute (key) and sets each of them as their adaptor pipeline.
|
||||
Receives a dict containing a list of adaptors for each desired attribute
|
||||
(key) and sets each of them as their adaptor pipeline.
|
||||
|
||||
.. method:: RobustScrapedItem.set_attrib_adaptors(self, attrib, pipe)
|
||||
.. method:: RobustScrapedItem.set_attrib_adaptors(attrib, pipe)
|
||||
|
||||
Sets the provided iterable (``pipe``) as the adaptor pipeline for the given attribute (``attrib``)
|
||||
Sets the provided iterable (``pipe``) as the adaptor pipeline for the
|
||||
given attribute (``attrib``)
|
||||
|
||||
.. method:: RobustScrapedItem.add_adaptor(self, attrib, adaptor, position=None)
|
||||
.. method:: RobustScrapedItem.add_adaptor(attrib, adaptor, position=None)
|
||||
|
||||
Adds an adaptor to an already existing (or not) pipeline.
|
||||
|
||||
@ -120,7 +149,22 @@ Methods
|
||||
:param adaptor: a callable to be added to the pipeline.
|
||||
|
||||
:param position: an integer representing the place where to add the adaptor.
|
||||
If it's `None`, the adaptor will be appended at the end of the pipeline.
|
||||
If it's ``None``, the adaptor will be appended at the end of the pipeline.
|
||||
|
||||
.. method:: RobustScrapedItem._add_single_attributes(attrname, attrtype, attributes)
|
||||
|
||||
This method is the one to be called whenever a single attribute has to be
|
||||
joined before storing into an item. That is,
|
||||
every time you have multiple results at the end of your adaptors pipeline,
|
||||
and you called the ``attribute`` method with the parameter `add=True`.
|
||||
|
||||
This method is intended to be overriden by you, since by default it
|
||||
raises an exception.
|
||||
|
||||
:param attrname: the name of the attribute you're setting
|
||||
:param attrtype: the type of the attribute you're setting
|
||||
:param attributes: the list of resulting values after the adaptors pipeline
|
||||
(the one you have to join somehow)
|
||||
|
||||
Examples
|
||||
--------
|
||||
@ -136,6 +180,9 @@ Creating a pretty basic item with a few attributes::
|
||||
'colours': [basestring],
|
||||
}
|
||||
|
||||
Setting some adaptors::
|
||||
|
||||
|
||||
.. note::
|
||||
|
||||
More RobustScrapedItem examples are about to come. In the meantime, check the :ref:`topics-adaptors` topic to see a few of them.
|
||||
|
Loading…
x
Reference in New Issue
Block a user