diff --git a/scrapy/trunk/docs/topics/adaptors.rst b/scrapy/trunk/docs/topics/adaptors.rst index 108956135..8781f95f6 100644 --- a/scrapy/trunk/docs/topics/adaptors.rst +++ b/scrapy/trunk/docs/topics/adaptors.rst @@ -12,73 +12,93 @@ Adaptors Quick overview ============== -Scrapy's adaptors are a nice feature attached to RobustScrapedItems that allow you -to easily modify (adapt to your needs) any kind of information you want to put in your items at assignation time. +Scrapy's adaptors are a nice feature attached to :class:`RobustScrapedItem` +that allow you to easily modify (adapt to your needs) any kind of information +you want to put in your items at assignation time. -The following diagram shows the data flow from the moment you call the `attribute` method until the attribute is -actually set. +The following diagram shows the data flow from the moment you call the +``attribute`` method until the attribute is actually set. .. image:: _images/adaptors_diagram.png -As you can see, adaptor pipelines are executed in tree form; which means that, for each of the values you pass to -the `attribute` method, the first adaptor will be applied. Then, for each of the resulting values of the first adaptor, -the second adaptor will be called, and so on. -This process will end up with a list of adapted values, which may contain zero, one, or many values. +As you can see, adaptor pipelines are executed in tree form; which means that, +for each of the values you pass to the ``attribute`` method, the first adaptor +will be applied. Then, for each of the resulting values of the first adaptor, +the second adaptor will be called, and so on. This process will end up with a +list of adapted values, which may contain zero, one, or many values. -In case the attribute is a single-valued (this is defined in the item's ATTRIBUTES dictionary), the first element of this -list will be set, unless you call the `attribute` method with the add parameter as True, in which case the item's method -`_add_single_attributes` will be called with the attribute's name, type, and the list of attributes to join as parameters. -By default, this method raises NotImplementedError, so you should override it in your items in order to join any kind of objects. +In case the attribute is a single-valued (this is defined in the item's +``ATTRIBUTES`` dictionary), the first element of this list will be set, unless +you call the ``attribute`` method with the add parameter as True, in which case +the item's method ``_add_single_attributes`` will be called with the +attribute's name, type, and the list of attributes to join as parameters. By +default, this method raises NotImplementedError, so you should override it in +your items in order to join any kind of objects. -If the attribute is a multivalued, the resulting list will be set to the item as is, unless you use -again- add=True, -in which case the list of already-existing values (if any) will be extended with the new one. +If the attribute is a multivalued, the resulting list will be set to the item +as is, unless you use -again- add=True, in which case the list of +already-existing values (if any) will be extended with the new one.pgq Adaptor Pipelines ================= .. class:: AdaptorPipe(adaptors=None) - An instance of this class represents an adaptor pipeline to be set for adapting a certain - item's attribute. - It provides some useful methods for adding/removing adaptors, and takes care of executing them properly. - Usually this class is not used directly, since the items already provide ways to handle adaptors without - having to manage AdaptorPipes. + An instance of this class represents an adaptor pipeline to be set for + adapting a certain item's attribute. It provides some useful methods for + adding/removing adaptors, and takes care of executing them properly. + Usually this class is not used directly, since the items already provide + ways to manage adaptors without having to handle AdaptorPipes. - :param adaptors: A list of callables to be added as adaptors at instancing time. + :param adaptors: A list of callables to be added as adaptors at + instancing time. Methods: .. method:: add_adaptor(adaptor, position=None) - This method is used for adding adaptors to the pipeline given a certain position. + This method is used for adding adaptors to the pipeline given + a certain position. :param adaptor: Any callable that works as an adaptor - :param position: An integer meaning the position in which the adaptor will be inserted. If it's None - the adaptor will be appended at the end of the pipeline. + :param position: An integer meaning the position in which the adaptor + will be inserted. If it's None the adaptor will be appended at + the end of the pipeline. Usage ===== -As it was previously said, in order to use adaptor pipelines you must inherit your items from the RobustScrapedItem class. -If you don't know anything about these items, read the :ref:`topics-items` reference first. +As it was previously said, in order to use adaptor pipelines you must inherit +your items from the :class:`RobustScrapedItem` class. If you don't know +anything about these items, read the :ref:`topics-items` reference first. -Once you've created your own item class (inherited from RobustScrapedItem) with the attributes you're going to use, -you have to add adaptor pipelines to each attribute you'd like to adapt data for. -For doing so, RobustScrapedItems provide some useful methods (`set_adaptors`, `set_attrib_adaptors`, and more), which are -also described in its reference. +Once you've created your own item class (inherited from +:class:`RobustScrapedItem`) with the attributes you're going to use, you have +to add adaptor pipelines to each attribute you'd like to adapt data for. For +doing so, RobustScrapedItems provide some useful methods like ``set_adaptors``, +``set_attrib_adaptors``, and more (which are also described in its reference) +so that you don't need to work with :class:`AdaptorPipe` objects directly. -But let's now talk a bit about adaptors (singularly), what are them, and how should they be implemented? -Adaptors are basically, any callable that receives a value, modifies it, and returns a new value (or more) so that the next -adaptor goes on with another adapting task (or not). -This is done this way to make the process of modifying information very customizable, and also to make adaptors reusable, -since they are intended to be small functions designed for simple purposes that can be applied in many different cases. -For example, you could make an adaptor for removing any tags in a text, like this:: +Adaptors +-------- + +Let's now talk a bit about adaptors (singularly), what are them, and how +should they be implemented? + +Adaptors are basically, any callable that receives +a value, modifies it, and returns a new value (or more) so that the next +adaptor goes on with another adapting task (or not). This is done this way to +make the process of modifying information very customizable, and also to make +adaptors reusable, since they are intended to be small functions designed for +simple purposes that can be applied in many different cases. For example, you +could make an adaptor for removing any tags in a text, like this:: >>> B_TAG_RE = re.compile(r'') >>> def remove_b_tags(text): >>> return B_TAG_RE.sub('', text) -Then you could easily add this adaptor to a certain attribute's pipeline like this:: +Then you could easily add this adaptor to a certain attribute's pipeline like +this:: >>> item = MyItem() >>> item.add_adaptor('text', remove_b_tags) @@ -86,15 +106,18 @@ Then you could easily add this adaptor to a certain attribute's pipeline like th >>> item.text u'some random text in bold and some random text in normal font' -As you can see, this would make any value that you set to the item through the `attribute` method first pass through the -`remove_b_tags` adaptor, which would also replace any matching tag with an empty string. +As you can see, this would make any value that you set to the item through the +``attribute`` method first pass through the ``remove_b_tags`` adaptor, which +would also replace any matching tag with an empty string. ---- -But anyway, let's now think of a bit more complicated (and useless) example: let's say you want to scrape a text, split it into single -letters, strip the vowels, turn the rest to capital letters, and join them again. -In this case, we could use three simple adaptors to process our data, plus a customized RobustScrapedItem for joining single -text attributes; let's see an example:: +But anyway, let's now think of a bit more complicated (and useless) example: +let's say you want to scrape a text, split it into single letters, strip the +vowels, turn the rest to capital letters, and join them again. In this case, +we could use three simple adaptors to process our data, plus a customized +:class:`RobustScrapedItem` for joining single text attributes; let's see an +example:: >>> # First of all, we define the item class we're going to use >>> from string import ascii_letters @@ -132,15 +155,51 @@ Let's now try with an example text to see what happens:: >>> item.text 'PWND' +More complex adaptors +--------------------- + +Now, after using adaptors a bit, you may find yourself in situations where you need +to use adaptors that receive other parameters from the ``attribute`` method +apart from the value to adapt. + +For example, imagine you have an adaptor that removes certain characters from strings +you provide. Would you make an adaptor for each combination of characters you'd like +to strip? Of course not! + +The way to handle this cases, is to make an adaptor that apart from receiving a value, +as any other adaptor, receives a parameter called ``adaptor_args``. +It's important that the parameter is called this way, since Scrapy finds out whether +an adaptor is able to receive extra parameters or not by making instrospection +and looking for a parameter called this way in the adaptor's parameters list. + +The information this parameter will receive won't be anything else but the same dictionary +of keyword arguments that you pass to the ``attribute`` method when calling it. + +But let's get back to the characters example, how would we implement this? +Quite simmilar to any other adaptor, let's see:: + + def strip_chars(value, adaptor_args): + chars = adaptor_args.get('strip_chars', []) + for char in chars: + value = value.replace(char, '') + return value + +Then, after creating an item and adding the adaptor to one of its pipelines, we could do:: + + >>> item.attribute('text', 'Hi, my name is John', strip_chars=['a', 'i', 'm']) + >>> item.text + 'H, y ne s John' + Debugging ========= -While you're coding spiders and adaptors, you usually need to know exactly what does Scrapy -do under the hood with the values you provide. -There's a setting called :setting:`ADAPTORS_DEBUG` for this purpose that makes Scrapy print -debugging messages each time an adaptors pipeline is run, specifying which attribute is being -adapted data for, the input/output values of each adaptor in the pipeline, and the input/output -of `_add_single_attributes` (in some cases). +While you're coding spiders and adaptors, you usually need to know exactly what +does Scrapy do under the hood with the values you provide. There's a setting +called :setting:``ADAPTORS_DEBUG`` for this purpose that makes Scrapy print +debugging messages each time an adaptors pipeline is run, specifying which +attribute is being adapted data for, the input/output values of each adaptor in +the pipeline, and the input/output of ``_add_single_attributes`` (in some +cases). -You can enable this setting as any other, either by adding it to your settings file, or by enabling -the environment variable `SCRAPY_ADAPTORS_DEBUG`. +You can enable this setting as any other, either by adding it to your settings +file, or by enabling the environment variable ``SCRAPY_ADAPTORS_DEBUG``. diff --git a/scrapy/trunk/docs/topics/items.rst b/scrapy/trunk/docs/topics/items.rst index 0cbf1b53b..cec853904 100644 --- a/scrapy/trunk/docs/topics/items.rst +++ b/scrapy/trunk/docs/topics/items.rst @@ -7,8 +7,9 @@ Items Quick overview ============== -| In Scrapy, items are the placeholder to use for the scraped data. - They are represented by a :class:`ScrapedItem` object, or any descendant class instance, and store the information in class attributes. +In Scrapy, items are the placeholder to use for the scraped data. They are +represented by a :class:`ScrapedItem` object, or any descendant class instance, +and store the information in class attributes. ScrapedItems ============ @@ -23,11 +24,13 @@ Methods .. method:: ScrapedItem.__init__(data=None) - :param data: A dictionary containing attributes and values to be set after instancing the item. + :param data: A dictionary containing attributes and values to be set + after instancing the item. - Instanciates a ``ScrapedItem`` object and sets an attribute and its value for each key in the given ``data`` - dict (if any). - These items are the most basic items available, and the common interface from which any items should inherit. + Instanciates a ``ScrapedItem`` object and sets an attribute and its value + for each key in the given ``data`` dict (if any). These items are the most + basic items available, and the common interface from which any items should + inherit. Examples -------- @@ -56,62 +59,88 @@ RobustScrapedItems .. class:: RobustScrapedItem - RobustScrapedItems are more complex items (compared to ScrapedItems) and have a few more features available, which + RobustScrapedItems are more complex items (compared to + :class:`ScrapedItem`) and have a few more features available, which include: - * Attributes dictionary: items that inherit from RobustScrapedItem are defined with a dictionary of attributes in the class. - This allows the item to have much more logic at the moment of handling and setting attributes. The next features are - built on top of this one. + * Attributes dictionary: items that inherit from RobustScrapedItem are + defined with a dictionary of attributes in the class. This allows the + item to have more logic at the moment of handling and setting attributes + than the :class:`ScrapedItem`. - * Adaptors: maybe the most important of the features that these items provide. The adaptors are a system designed for - filtering/modifying data before setting it to the item, that makes cleansing tasks *a lot* easier. + * Adaptors: perhaps the most important of the features these items provide. + The adaptors are a system designed for filtering/modifying data before + setting it to the item, that makes cleansing tasks a lot easier. - * Type checking: RobustScrapedItems come with a built-in type checking which assures you that no data of the wrong type will - get into the items without raising a warning. + * Type checking: RobustScrapedItems come with a built-in type checking + which assures you that no data of the wrong type will get into the items + without raising a warning. - * Versioning: These items also provide versioning by making a unique hash for each item based on its attributes values. + * Versioning: These items also provide versioning by making a unique hash + for each item based on its attributes values. - * ItemDeltas: You can subtract two RobustScrapedItems, which allows you to know the difference between a pair of items. - This difference is represented by a RobustItemDelta object. + * ItemDeltas: You can subtract two RobustScrapedItems, which allows you to + know the difference between a pair of items. This difference is + represented by a RobustItemDelta object. + +Attributes +---------- + +.. attribute:: RobustScrapedItem.ATTRIBUTES + + This attribute **must** be specified when writing your items, and it's a + dictionary in which the keys are the names of the attributes your item will + have, and their values are the type of those attributes. For multivalued + attributes, you should write the type of the values inside a list, e.g: + ``'numbers': [int]`` Methods ------- .. method:: RobustScrapedItem.__init__(data=None, adaptor_args=None) - :param data: Idem as in ScrapedItems - :param adaptor_args: A dictionary of the like "attribute -> list of adaptors" for defining adaptors automatically after - instancing the item. + :param data: Idem as for ScrapedItems + :param adaptor_args: A dictionary of the kind + ``'attribute': [list_of_adaptors]``" for defining adaptors automatically + after instancing the item. Constructor of RobustScrapedItem objects. -.. method:: RobustScrapedItem.attribute(self, attrname, value, override=False, add=False, ***kwargs) +.. method:: RobustScrapedItem.attribute(attrname, value, override=False, add=False, ***kwargs) - Sets the item's ``attrname`` attribute with the given ``value`` filtering it through the given attribute's adaptor - pipeline (if any). + Sets the item's ``attrname`` attribute with the given ``value`` filtering + it through the given attribute's adaptor pipeline (if any). - :param attrname: a string containing the name of the attribute you want to set. + :param attrname: a string containing the name of the attribute you want + to set. - :param value: the value you want to assign, which will be adapted by the corresponding adaptors for the given attribute (if any). + :param value: the value you want to assign, which will be adapted by + the corresponding adaptors for the given attribute (if any). - :param override: if True, makes this method avoid checking if there was a previous value and sets ``value`` no matter what. + :param override: if True, makes this method avoid checking if there + was a previous value and sets ``value`` no matter what. - :param add: if True, tries to concatenate the given ``value`` with the one already set in the item. - For multivalued attributes, this will extend the list of already-set values, with the new ones. - For single valued attributes, the method _add_single_attributes (which is explained below) will be called. + :param add: if True, tries to concatenate the given ``value`` with the one + already set in the item. For multivalued attributes, this will extend + the list of already-set values, with the new ones. + For single valued attributes, the method _add_single_attributes (which + is explained below) will be called. - :param kwargs: any extra parameters will be passed in a dictionary to any adaptor that receives a parameter called 'adaptor_args'. + :param kwargs: any extra parameters will be passed in a dictionary to any + adaptor that receives a parameter called ``adaptor_args``. Check the :ref:`topics-adaptors` topic for more information. -.. method:: RobustScrapedItem.set_adaptors(self, adaptors_dict) +.. method:: RobustScrapedItem.set_adaptors(adaptors_dict) - Receives a dict containing a list of adaptors for each desired attribute (key) and sets each of them as their adaptor pipeline. + Receives a dict containing a list of adaptors for each desired attribute + (key) and sets each of them as their adaptor pipeline. -.. method:: RobustScrapedItem.set_attrib_adaptors(self, attrib, pipe) +.. method:: RobustScrapedItem.set_attrib_adaptors(attrib, pipe) - Sets the provided iterable (``pipe``) as the adaptor pipeline for the given attribute (``attrib``) + Sets the provided iterable (``pipe``) as the adaptor pipeline for the + given attribute (``attrib``) -.. method:: RobustScrapedItem.add_adaptor(self, attrib, adaptor, position=None) +.. method:: RobustScrapedItem.add_adaptor(attrib, adaptor, position=None) Adds an adaptor to an already existing (or not) pipeline. @@ -120,7 +149,22 @@ Methods :param adaptor: a callable to be added to the pipeline. :param position: an integer representing the place where to add the adaptor. - If it's `None`, the adaptor will be appended at the end of the pipeline. + If it's ``None``, the adaptor will be appended at the end of the pipeline. + +.. method:: RobustScrapedItem._add_single_attributes(attrname, attrtype, attributes) + + This method is the one to be called whenever a single attribute has to be + joined before storing into an item. That is, + every time you have multiple results at the end of your adaptors pipeline, + and you called the ``attribute`` method with the parameter `add=True`. + + This method is intended to be overriden by you, since by default it + raises an exception. + + :param attrname: the name of the attribute you're setting + :param attrtype: the type of the attribute you're setting + :param attributes: the list of resulting values after the adaptors pipeline + (the one you have to join somehow) Examples -------- @@ -136,6 +180,9 @@ Creating a pretty basic item with a few attributes:: 'colours': [basestring], } +Setting some adaptors:: + + .. note:: More RobustScrapedItem examples are about to come. In the meantime, check the :ref:`topics-adaptors` topic to see a few of them.