1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-25 14:43:46 +00:00

Several corrections made to items and adaptors documentation

--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40804
This commit is contained in:
elpolilla 2009-01-30 13:05:52 +00:00
parent d9a90a1f3f
commit 648753190c
2 changed files with 192 additions and 86 deletions

View File

@ -12,73 +12,93 @@ Adaptors
Quick overview
==============
Scrapy's adaptors are a nice feature attached to RobustScrapedItems that allow you
to easily modify (adapt to your needs) any kind of information you want to put in your items at assignation time.
Scrapy's adaptors are a nice feature attached to :class:`RobustScrapedItem`
that allow you to easily modify (adapt to your needs) any kind of information
you want to put in your items at assignation time.
The following diagram shows the data flow from the moment you call the `attribute` method until the attribute is
actually set.
The following diagram shows the data flow from the moment you call the
``attribute`` method until the attribute is actually set.
.. image:: _images/adaptors_diagram.png
As you can see, adaptor pipelines are executed in tree form; which means that, for each of the values you pass to
the `attribute` method, the first adaptor will be applied. Then, for each of the resulting values of the first adaptor,
the second adaptor will be called, and so on.
This process will end up with a list of adapted values, which may contain zero, one, or many values.
As you can see, adaptor pipelines are executed in tree form; which means that,
for each of the values you pass to the ``attribute`` method, the first adaptor
will be applied. Then, for each of the resulting values of the first adaptor,
the second adaptor will be called, and so on. This process will end up with a
list of adapted values, which may contain zero, one, or many values.
In case the attribute is a single-valued (this is defined in the item's ATTRIBUTES dictionary), the first element of this
list will be set, unless you call the `attribute` method with the add parameter as True, in which case the item's method
`_add_single_attributes` will be called with the attribute's name, type, and the list of attributes to join as parameters.
By default, this method raises NotImplementedError, so you should override it in your items in order to join any kind of objects.
In case the attribute is a single-valued (this is defined in the item's
``ATTRIBUTES`` dictionary), the first element of this list will be set, unless
you call the ``attribute`` method with the add parameter as True, in which case
the item's method ``_add_single_attributes`` will be called with the
attribute's name, type, and the list of attributes to join as parameters. By
default, this method raises NotImplementedError, so you should override it in
your items in order to join any kind of objects.
If the attribute is a multivalued, the resulting list will be set to the item as is, unless you use -again- add=True,
in which case the list of already-existing values (if any) will be extended with the new one.
If the attribute is a multivalued, the resulting list will be set to the item
as is, unless you use -again- add=True, in which case the list of
already-existing values (if any) will be extended with the new one.pgq
Adaptor Pipelines
=================
.. class:: AdaptorPipe(adaptors=None)
An instance of this class represents an adaptor pipeline to be set for adapting a certain
item's attribute.
It provides some useful methods for adding/removing adaptors, and takes care of executing them properly.
Usually this class is not used directly, since the items already provide ways to handle adaptors without
having to manage AdaptorPipes.
An instance of this class represents an adaptor pipeline to be set for
adapting a certain item's attribute. It provides some useful methods for
adding/removing adaptors, and takes care of executing them properly.
Usually this class is not used directly, since the items already provide
ways to manage adaptors without having to handle AdaptorPipes.
:param adaptors: A list of callables to be added as adaptors at instancing time.
:param adaptors: A list of callables to be added as adaptors at
instancing time.
Methods:
.. method:: add_adaptor(adaptor, position=None)
This method is used for adding adaptors to the pipeline given a certain position.
This method is used for adding adaptors to the pipeline given
a certain position.
:param adaptor: Any callable that works as an adaptor
:param position: An integer meaning the position in which the adaptor will be inserted. If it's None
the adaptor will be appended at the end of the pipeline.
:param position: An integer meaning the position in which the adaptor
will be inserted. If it's None the adaptor will be appended at
the end of the pipeline.
Usage
=====
As it was previously said, in order to use adaptor pipelines you must inherit your items from the RobustScrapedItem class.
If you don't know anything about these items, read the :ref:`topics-items` reference first.
As it was previously said, in order to use adaptor pipelines you must inherit
your items from the :class:`RobustScrapedItem` class. If you don't know
anything about these items, read the :ref:`topics-items` reference first.
Once you've created your own item class (inherited from RobustScrapedItem) with the attributes you're going to use,
you have to add adaptor pipelines to each attribute you'd like to adapt data for.
For doing so, RobustScrapedItems provide some useful methods (`set_adaptors`, `set_attrib_adaptors`, and more), which are
also described in its reference.
Once you've created your own item class (inherited from
:class:`RobustScrapedItem`) with the attributes you're going to use, you have
to add adaptor pipelines to each attribute you'd like to adapt data for. For
doing so, RobustScrapedItems provide some useful methods like ``set_adaptors``,
``set_attrib_adaptors``, and more (which are also described in its reference)
so that you don't need to work with :class:`AdaptorPipe` objects directly.
But let's now talk a bit about adaptors (singularly), what are them, and how should they be implemented?
Adaptors are basically, any callable that receives a value, modifies it, and returns a new value (or more) so that the next
adaptor goes on with another adapting task (or not).
This is done this way to make the process of modifying information very customizable, and also to make adaptors reusable,
since they are intended to be small functions designed for simple purposes that can be applied in many different cases.
For example, you could make an adaptor for removing any <b> tags in a text, like this::
Adaptors
--------
Let's now talk a bit about adaptors (singularly), what are them, and how
should they be implemented?
Adaptors are basically, any callable that receives
a value, modifies it, and returns a new value (or more) so that the next
adaptor goes on with another adapting task (or not). This is done this way to
make the process of modifying information very customizable, and also to make
adaptors reusable, since they are intended to be small functions designed for
simple purposes that can be applied in many different cases. For example, you
could make an adaptor for removing any <b> tags in a text, like this::
>>> B_TAG_RE = re.compile(r'</?b\s*>')
>>> def remove_b_tags(text):
>>> return B_TAG_RE.sub('', text)
Then you could easily add this adaptor to a certain attribute's pipeline like this::
Then you could easily add this adaptor to a certain attribute's pipeline like
this::
>>> item = MyItem()
>>> item.add_adaptor('text', remove_b_tags)
@ -86,15 +106,18 @@ Then you could easily add this adaptor to a certain attribute's pipeline like th
>>> item.text
u'some random text in bold and some random text in normal font'
As you can see, this would make any value that you set to the item through the `attribute` method first pass through the
`remove_b_tags` adaptor, which would also replace any matching tag with an empty string.
As you can see, this would make any value that you set to the item through the
``attribute`` method first pass through the ``remove_b_tags`` adaptor, which
would also replace any matching tag with an empty string.
----
But anyway, let's now think of a bit more complicated (and useless) example: let's say you want to scrape a text, split it into single
letters, strip the vowels, turn the rest to capital letters, and join them again.
In this case, we could use three simple adaptors to process our data, plus a customized RobustScrapedItem for joining single
text attributes; let's see an example::
But anyway, let's now think of a bit more complicated (and useless) example:
let's say you want to scrape a text, split it into single letters, strip the
vowels, turn the rest to capital letters, and join them again. In this case,
we could use three simple adaptors to process our data, plus a customized
:class:`RobustScrapedItem` for joining single text attributes; let's see an
example::
>>> # First of all, we define the item class we're going to use
>>> from string import ascii_letters
@ -132,15 +155,51 @@ Let's now try with an example text to see what happens::
>>> item.text
'PWND'
More complex adaptors
---------------------
Now, after using adaptors a bit, you may find yourself in situations where you need
to use adaptors that receive other parameters from the ``attribute`` method
apart from the value to adapt.
For example, imagine you have an adaptor that removes certain characters from strings
you provide. Would you make an adaptor for each combination of characters you'd like
to strip? Of course not!
The way to handle this cases, is to make an adaptor that apart from receiving a value,
as any other adaptor, receives a parameter called ``adaptor_args``.
It's important that the parameter is called this way, since Scrapy finds out whether
an adaptor is able to receive extra parameters or not by making instrospection
and looking for a parameter called this way in the adaptor's parameters list.
The information this parameter will receive won't be anything else but the same dictionary
of keyword arguments that you pass to the ``attribute`` method when calling it.
But let's get back to the characters example, how would we implement this?
Quite simmilar to any other adaptor, let's see::
def strip_chars(value, adaptor_args):
chars = adaptor_args.get('strip_chars', [])
for char in chars:
value = value.replace(char, '')
return value
Then, after creating an item and adding the adaptor to one of its pipelines, we could do::
>>> item.attribute('text', 'Hi, my name is John', strip_chars=['a', 'i', 'm'])
>>> item.text
'H, y ne s John'
Debugging
=========
While you're coding spiders and adaptors, you usually need to know exactly what does Scrapy
do under the hood with the values you provide.
There's a setting called :setting:`ADAPTORS_DEBUG` for this purpose that makes Scrapy print
debugging messages each time an adaptors pipeline is run, specifying which attribute is being
adapted data for, the input/output values of each adaptor in the pipeline, and the input/output
of `_add_single_attributes` (in some cases).
While you're coding spiders and adaptors, you usually need to know exactly what
does Scrapy do under the hood with the values you provide. There's a setting
called :setting:``ADAPTORS_DEBUG`` for this purpose that makes Scrapy print
debugging messages each time an adaptors pipeline is run, specifying which
attribute is being adapted data for, the input/output values of each adaptor in
the pipeline, and the input/output of ``_add_single_attributes`` (in some
cases).
You can enable this setting as any other, either by adding it to your settings file, or by enabling
the environment variable `SCRAPY_ADAPTORS_DEBUG`.
You can enable this setting as any other, either by adding it to your settings
file, or by enabling the environment variable ``SCRAPY_ADAPTORS_DEBUG``.

View File

@ -7,8 +7,9 @@ Items
Quick overview
==============
| In Scrapy, items are the placeholder to use for the scraped data.
They are represented by a :class:`ScrapedItem` object, or any descendant class instance, and store the information in class attributes.
In Scrapy, items are the placeholder to use for the scraped data. They are
represented by a :class:`ScrapedItem` object, or any descendant class instance,
and store the information in class attributes.
ScrapedItems
============
@ -23,11 +24,13 @@ Methods
.. method:: ScrapedItem.__init__(data=None)
:param data: A dictionary containing attributes and values to be set after instancing the item.
:param data: A dictionary containing attributes and values to be set
after instancing the item.
Instanciates a ``ScrapedItem`` object and sets an attribute and its value for each key in the given ``data``
dict (if any).
These items are the most basic items available, and the common interface from which any items should inherit.
Instanciates a ``ScrapedItem`` object and sets an attribute and its value
for each key in the given ``data`` dict (if any). These items are the most
basic items available, and the common interface from which any items should
inherit.
Examples
--------
@ -56,62 +59,88 @@ RobustScrapedItems
.. class:: RobustScrapedItem
RobustScrapedItems are more complex items (compared to ScrapedItems) and have a few more features available, which
RobustScrapedItems are more complex items (compared to
:class:`ScrapedItem`) and have a few more features available, which
include:
* Attributes dictionary: items that inherit from RobustScrapedItem are defined with a dictionary of attributes in the class.
This allows the item to have much more logic at the moment of handling and setting attributes. The next features are
built on top of this one.
* Attributes dictionary: items that inherit from RobustScrapedItem are
defined with a dictionary of attributes in the class. This allows the
item to have more logic at the moment of handling and setting attributes
than the :class:`ScrapedItem`.
* Adaptors: maybe the most important of the features that these items provide. The adaptors are a system designed for
filtering/modifying data before setting it to the item, that makes cleansing tasks *a lot* easier.
* Adaptors: perhaps the most important of the features these items provide.
The adaptors are a system designed for filtering/modifying data before
setting it to the item, that makes cleansing tasks a lot easier.
* Type checking: RobustScrapedItems come with a built-in type checking which assures you that no data of the wrong type will
get into the items without raising a warning.
* Type checking: RobustScrapedItems come with a built-in type checking
which assures you that no data of the wrong type will get into the items
without raising a warning.
* Versioning: These items also provide versioning by making a unique hash for each item based on its attributes values.
* Versioning: These items also provide versioning by making a unique hash
for each item based on its attributes values.
* ItemDeltas: You can subtract two RobustScrapedItems, which allows you to know the difference between a pair of items.
This difference is represented by a RobustItemDelta object.
* ItemDeltas: You can subtract two RobustScrapedItems, which allows you to
know the difference between a pair of items. This difference is
represented by a RobustItemDelta object.
Attributes
----------
.. attribute:: RobustScrapedItem.ATTRIBUTES
This attribute **must** be specified when writing your items, and it's a
dictionary in which the keys are the names of the attributes your item will
have, and their values are the type of those attributes. For multivalued
attributes, you should write the type of the values inside a list, e.g:
``'numbers': [int]``
Methods
-------
.. method:: RobustScrapedItem.__init__(data=None, adaptor_args=None)
:param data: Idem as in ScrapedItems
:param adaptor_args: A dictionary of the like "attribute -> list of adaptors" for defining adaptors automatically after
instancing the item.
:param data: Idem as for ScrapedItems
:param adaptor_args: A dictionary of the kind
``'attribute': [list_of_adaptors]``" for defining adaptors automatically
after instancing the item.
Constructor of RobustScrapedItem objects.
.. method:: RobustScrapedItem.attribute(self, attrname, value, override=False, add=False, ***kwargs)
.. method:: RobustScrapedItem.attribute(attrname, value, override=False, add=False, ***kwargs)
Sets the item's ``attrname`` attribute with the given ``value`` filtering it through the given attribute's adaptor
pipeline (if any).
Sets the item's ``attrname`` attribute with the given ``value`` filtering
it through the given attribute's adaptor pipeline (if any).
:param attrname: a string containing the name of the attribute you want to set.
:param attrname: a string containing the name of the attribute you want
to set.
:param value: the value you want to assign, which will be adapted by the corresponding adaptors for the given attribute (if any).
:param value: the value you want to assign, which will be adapted by
the corresponding adaptors for the given attribute (if any).
:param override: if True, makes this method avoid checking if there was a previous value and sets ``value`` no matter what.
:param override: if True, makes this method avoid checking if there
was a previous value and sets ``value`` no matter what.
:param add: if True, tries to concatenate the given ``value`` with the one already set in the item.
For multivalued attributes, this will extend the list of already-set values, with the new ones.
For single valued attributes, the method _add_single_attributes (which is explained below) will be called.
:param add: if True, tries to concatenate the given ``value`` with the one
already set in the item. For multivalued attributes, this will extend
the list of already-set values, with the new ones.
For single valued attributes, the method _add_single_attributes (which
is explained below) will be called.
:param kwargs: any extra parameters will be passed in a dictionary to any adaptor that receives a parameter called 'adaptor_args'.
:param kwargs: any extra parameters will be passed in a dictionary to any
adaptor that receives a parameter called ``adaptor_args``.
Check the :ref:`topics-adaptors` topic for more information.
.. method:: RobustScrapedItem.set_adaptors(self, adaptors_dict)
.. method:: RobustScrapedItem.set_adaptors(adaptors_dict)
Receives a dict containing a list of adaptors for each desired attribute (key) and sets each of them as their adaptor pipeline.
Receives a dict containing a list of adaptors for each desired attribute
(key) and sets each of them as their adaptor pipeline.
.. method:: RobustScrapedItem.set_attrib_adaptors(self, attrib, pipe)
.. method:: RobustScrapedItem.set_attrib_adaptors(attrib, pipe)
Sets the provided iterable (``pipe``) as the adaptor pipeline for the given attribute (``attrib``)
Sets the provided iterable (``pipe``) as the adaptor pipeline for the
given attribute (``attrib``)
.. method:: RobustScrapedItem.add_adaptor(self, attrib, adaptor, position=None)
.. method:: RobustScrapedItem.add_adaptor(attrib, adaptor, position=None)
Adds an adaptor to an already existing (or not) pipeline.
@ -120,7 +149,22 @@ Methods
:param adaptor: a callable to be added to the pipeline.
:param position: an integer representing the place where to add the adaptor.
If it's `None`, the adaptor will be appended at the end of the pipeline.
If it's ``None``, the adaptor will be appended at the end of the pipeline.
.. method:: RobustScrapedItem._add_single_attributes(attrname, attrtype, attributes)
This method is the one to be called whenever a single attribute has to be
joined before storing into an item. That is,
every time you have multiple results at the end of your adaptors pipeline,
and you called the ``attribute`` method with the parameter `add=True`.
This method is intended to be overriden by you, since by default it
raises an exception.
:param attrname: the name of the attribute you're setting
:param attrtype: the type of the attribute you're setting
:param attributes: the list of resulting values after the adaptors pipeline
(the one you have to join somehow)
Examples
--------
@ -136,6 +180,9 @@ Creating a pretty basic item with a few attributes::
'colours': [basestring],
}
Setting some adaptors::
.. note::
More RobustScrapedItem examples are about to come. In the meantime, check the :ref:`topics-adaptors` topic to see a few of them.