mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-22 14:32:58 +00:00
138 lines
3.1 KiB
ReStructuredText
138 lines
3.1 KiB
ReStructuredText
======= =============================
|
|
SEP 7
|
|
Title ItemLoader processors library
|
|
Author Ismael Carnales
|
|
Created 2009-08-10
|
|
Status Draft
|
|
======= =============================
|
|
|
|
======================================
|
|
SEP-007: ItemLoader processors library
|
|
======================================
|
|
|
|
This SEP proposes a library of ``ItemLoader`` processor to ship with Scrapy.
|
|
|
|
date.py
|
|
=======
|
|
|
|
``to_date``
|
|
-----------
|
|
|
|
Converts a date string to a YYYY-MM-DD one suitable for ``DateField``
|
|
|
|
**Decision**: Obsolete. ``DateField`` doesn't exists anymore.
|
|
|
|
extraction.py
|
|
=============
|
|
|
|
``extract``
|
|
-----------
|
|
|
|
This adaptor tries to extract data from the given locations. Any
|
|
``XPathSelector`` in it will be extracted, and any other data will be added
|
|
as-is to the result.
|
|
|
|
**Decision**: Obsolete. Functionality included in ``XpathLoader``.
|
|
|
|
``ExtractImageLinks``
|
|
|
|
This adaptor may receive either XPathSelectors pointing to the desired
|
|
locations for finding image urls, or just a list of XPath expressions (which
|
|
will be turned into selectors anyway).
|
|
|
|
**Decision**: XXX
|
|
|
|
markup.py
|
|
=========
|
|
|
|
``remove_tags``
|
|
---------------
|
|
|
|
Factory that returns an adaptor for removing each tag in the ``tags`` parameter
|
|
found in the given value. If no ``tags`` are specified, all of them are
|
|
removed.
|
|
|
|
**Decision**: XXX
|
|
|
|
``remove_root``
|
|
---------------
|
|
|
|
This adaptor removes the root tag of the given string/unicode, if it's found.
|
|
|
|
**Decision**: XXX
|
|
|
|
``replace_escape``
|
|
------------------
|
|
|
|
Factory that returns an adaptor for removing/replacing each escape character in
|
|
the ``wich_ones`` parameter found in the given value.
|
|
|
|
**Decision**: XXX
|
|
|
|
``unquote``
|
|
-----------
|
|
|
|
This factory returns an adaptor that receives a string or unicode, removes all
|
|
of the CDATAs and entities (except the ones in CDATAs, and the ones you specify
|
|
in the ``keep`` parameter) and then, returns a new string or unicode.
|
|
|
|
**Decision**: XXX
|
|
|
|
misc.py
|
|
=======
|
|
|
|
``to_unicode``
|
|
--------------
|
|
|
|
Receives a string and converts it to unicode using the given encoding (if
|
|
specified, else utf-8 is used) and returns a new unicode object. E.g:
|
|
|
|
::
|
|
|
|
>> to_unicode('it costs 20\xe2\x82\xac, or 30\xc2\xa3')
|
|
[u'it costs 20\u20ac, or 30\xa3']
|
|
|
|
**Decision**: XXX
|
|
|
|
``clean_spaces``
|
|
----------------
|
|
|
|
Converts multispaces into single spaces for the given string. E.g:
|
|
|
|
::
|
|
|
|
>> clean_spaces(u'Hello sir')
|
|
u'Hello sir'
|
|
|
|
**Decision**: XXX
|
|
|
|
``drop_empty``
|
|
--------------
|
|
|
|
Removes any index that evaluates to None from the provided iterable. E.g:
|
|
|
|
::
|
|
|
|
>> drop_empty([0, 'this', None, 'is', False, 'an example'])
|
|
['this', 'is', 'an example']
|
|
|
|
**Decision**: Obsolete. Functionality included in reducers.
|
|
|
|
``delist``
|
|
----------
|
|
|
|
This factory returns and adaptor that joins an iterable with the specified
|
|
delimiter.
|
|
|
|
**Decision**: Obsolete. Functionality included in reducers.
|
|
|
|
``Regex``
|
|
----------
|
|
|
|
This adaptor must receive either a list of strings or an XPathSelector and
|
|
return a new list with the matches of the given strings with the given regular
|
|
expression (which is passed by a keyword argument, and is mandatory for this
|
|
adaptor).
|
|
|
|
**Decision**: XXX
|