mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-25 09:43:46 +00:00
206 lines
8.2 KiB
ReStructuredText
206 lines
8.2 KiB
ReStructuredText
.. _topics-adaptors:
|
|
|
|
==============================
|
|
Old Item Adaptors (deprecated)
|
|
==============================
|
|
|
|
.. warning::
|
|
|
|
This documentation is deprecated, and will be superseeded by the new item
|
|
adaptors (not yet documented).
|
|
|
|
Quick overview
|
|
==============
|
|
|
|
Scrapy's adaptors are a nice feature attached to :class:`RobustScrapedItem`
|
|
that allow you to easily modify (adapt to your needs) any kind of information
|
|
you want to put in your items at assignation time.
|
|
|
|
The following diagram shows the data flow from the moment you call the
|
|
``attribute`` method until the attribute is actually set.
|
|
|
|
.. image:: _images/adaptors_diagram.png
|
|
|
|
As you can see, adaptor pipelines are executed in tree form; which means that,
|
|
for each of the values you pass to the ``attribute`` method, the first adaptor
|
|
will be applied. Then, for each of the resulting values of the first adaptor,
|
|
the second adaptor will be called, and so on. This process will end up with a
|
|
list of adapted values, which may contain zero, one, or many values.
|
|
|
|
In case the attribute is a single-valued (this is defined in the item's
|
|
``ATTRIBUTES`` dictionary), the first element of this list will be set, unless
|
|
you call the ``attribute`` method with the add parameter as True, in which case
|
|
the item's method ``_add_single_attributes`` will be called with the
|
|
attribute's name, type, and the list of attributes to join as parameters. By
|
|
default, this method raises NotImplementedError, so you should override it in
|
|
your items in order to join any kind of objects.
|
|
|
|
If the attribute is a multivalued, the resulting list will be set to the item
|
|
as is, unless you use -again- add=True, in which case the list of
|
|
already-existing values (if any) will be extended with the new one.pgq
|
|
|
|
Adaptor Pipelines
|
|
=================
|
|
|
|
.. class:: AdaptorPipe(adaptors=None)
|
|
|
|
An instance of this class represents an adaptor pipeline to be set for
|
|
adapting a certain item's attribute. It provides some useful methods for
|
|
adding/removing adaptors, and takes care of executing them properly.
|
|
Usually this class is not used directly, since the items already provide
|
|
ways to manage adaptors without having to handle AdaptorPipes.
|
|
|
|
:param adaptors: A list of callables to be added as adaptors at
|
|
instancing time.
|
|
|
|
Methods:
|
|
|
|
.. method:: add_adaptor(adaptor, position=None)
|
|
|
|
This method is used for adding adaptors to the pipeline given
|
|
a certain position.
|
|
|
|
:param adaptor: Any callable that works as an adaptor
|
|
:param position: An integer meaning the position in which the adaptor
|
|
will be inserted. If it's None the adaptor will be appended at
|
|
the end of the pipeline.
|
|
|
|
Usage
|
|
=====
|
|
|
|
As it was previously said, in order to use adaptor pipelines you must inherit
|
|
your items from the :class:`RobustScrapedItem` class. If you don't know
|
|
anything about these items, read the :ref:`topics-items` reference first.
|
|
|
|
Once you've created your own item class (inherited from
|
|
:class:`RobustScrapedItem`) with the attributes you're going to use, you have
|
|
to add adaptor pipelines to each attribute you'd like to adapt data for. For
|
|
doing so, RobustScrapedItems provide some useful methods like ``set_adaptors``,
|
|
``set_attrib_adaptors``, and more (which are also described in its reference)
|
|
so that you don't need to work with :class:`AdaptorPipe` objects directly.
|
|
|
|
Adaptors
|
|
--------
|
|
|
|
Let's now talk a bit about adaptors (singularly), what are them, and how
|
|
should they be implemented?
|
|
|
|
Adaptors are basically, any callable that receives
|
|
a value, modifies it, and returns a new value (or more) so that the next
|
|
adaptor goes on with another adapting task (or not). This is done this way to
|
|
make the process of modifying information very customizable, and also to make
|
|
adaptors reusable, since they are intended to be small functions designed for
|
|
simple purposes that can be applied in many different cases. For example, you
|
|
could make an adaptor for removing any <b> tags in a text, like this::
|
|
|
|
>>> B_TAG_RE = re.compile(r'</?b\s*>')
|
|
>>> def remove_b_tags(text):
|
|
>>> return B_TAG_RE.sub('', text)
|
|
|
|
Then you could easily add this adaptor to a certain attribute's pipeline like
|
|
this::
|
|
|
|
>>> item = MyItem()
|
|
>>> item.add_adaptor('text', remove_b_tags)
|
|
>>> item.attribute('text', u'<b>some random text in bold</b> and some random text in normal font')
|
|
>>> item.text
|
|
u'some random text in bold and some random text in normal font'
|
|
|
|
As you can see, this would make any value that you set to the item through the
|
|
``attribute`` method first pass through the ``remove_b_tags`` adaptor, which
|
|
would also replace any matching tag with an empty string.
|
|
|
|
----
|
|
|
|
But anyway, let's now think of a bit more complicated (and useless) example:
|
|
let's say you want to scrape a text, split it into single letters, strip the
|
|
vowels, turn the rest to capital letters, and join them again. In this case,
|
|
we could use three simple adaptors to process our data, plus a customized
|
|
:class:`RobustScrapedItem` for joining single text attributes; let's see an
|
|
example::
|
|
|
|
>>> # First of all, we define the item class we're going to use
|
|
>>> from string import ascii_letters
|
|
>>> from scrapy.contrib.item import RobustScrapedItem
|
|
>>> class MyItem(RobustScrapedItem):
|
|
>>> ATTRIBUTES = {
|
|
>>> 'text': basestring,
|
|
>>> }
|
|
|
|
>>> def _add_single_attributes(self, attrname, attrtype, attributes):
|
|
>>> return ''.join(attributes)
|
|
|
|
>>> # Now we'll write the needed adaptors
|
|
>>> def to_letters(text):
|
|
>>> return tuple(letter for letter in text)
|
|
|
|
>>> def is_vowel(letter):
|
|
>>> if letter in ascii_letters and letter.lower() not in ('a', 'e', 'i', 'o', 'u'):
|
|
>>> return letter
|
|
|
|
>>> def to_upper(letter):
|
|
>>> return letter.upper()
|
|
|
|
>>> # Finally, we'll join all the pieces and see how it works
|
|
>>> item = MyItem()
|
|
>>> item.set_attrib_adaptors('text', [
|
|
>>> to_letters,
|
|
>>> is_vowel,
|
|
>>> to_upper,
|
|
>>> ])
|
|
|
|
Let's now try with an example text to see what happens::
|
|
|
|
>>> item.attribute('text', 'pi', 'wind', add=True)
|
|
>>> item.text
|
|
'PWND'
|
|
|
|
More complex adaptors
|
|
---------------------
|
|
|
|
Now, after using adaptors a bit, you may find yourself in situations where you need
|
|
to use adaptors that receive other parameters from the ``attribute`` method
|
|
apart from the value to adapt.
|
|
|
|
For example, imagine you have an adaptor that removes certain characters from strings
|
|
you provide. Would you make an adaptor for each combination of characters you'd like
|
|
to strip? Of course not!
|
|
|
|
The way to handle this cases, is to make an adaptor that apart from receiving a value,
|
|
as any other adaptor, receives a parameter called ``adaptor_args``.
|
|
It's important that the parameter is called this way, since Scrapy finds out whether
|
|
an adaptor is able to receive extra parameters or not by making instrospection
|
|
and looking for a parameter called this way in the adaptor's parameters list.
|
|
|
|
The information this parameter will receive won't be anything else but the same dictionary
|
|
of keyword arguments that you pass to the ``attribute`` method when calling it.
|
|
|
|
But let's get back to the characters example, how would we implement this?
|
|
Quite simmilar to any other adaptor, let's see::
|
|
|
|
def strip_chars(value, adaptor_args):
|
|
chars = adaptor_args.get('strip_chars', [])
|
|
for char in chars:
|
|
value = value.replace(char, '')
|
|
return value
|
|
|
|
Then, after creating an item and adding the adaptor to one of its pipelines, we could do::
|
|
|
|
>>> item.attribute('text', 'Hi, my name is John', strip_chars=['a', 'i', 'm'])
|
|
>>> item.text
|
|
'H, y ne s John'
|
|
|
|
Debugging
|
|
=========
|
|
|
|
While you're coding spiders and adaptors, you usually need to know exactly what
|
|
does Scrapy do under the hood with the values you provide. There's a setting
|
|
called :setting:``ADAPTORS_DEBUG`` for this purpose that makes Scrapy print
|
|
debugging messages each time an adaptors pipeline is run, specifying which
|
|
attribute is being adapted data for, the input/output values of each adaptor in
|
|
the pipeline, and the input/output of ``_add_single_attributes`` (in some
|
|
cases).
|
|
|
|
You can enable this setting as any other, either by adding it to your settings
|
|
file, or by enabling the environment variable ``SCRAPY_ADAPTORS_DEBUG``.
|