mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-24 14:04:01 +00:00
108 lines
3.0 KiB
Plaintext
108 lines
3.0 KiB
Plaintext
= SEP-007: !ItemLoader processors library =
|
|
|
|
[[PageOutline(2-5, Contents)]]
|
|
|
|
||'''SEP'''||7||
|
|
||'''Title'''||!ItemLoader processors library||
|
|
||'''Author'''||Ismael Carnales||
|
|
||'''Created'''||2009-08-10||
|
|
||'''Status'''||Draft||
|
|
|
|
== Introduction ==
|
|
|
|
This SEP proposes a library of !ItemLoader processor to ship with Scrapy.
|
|
|
|
== date.py ==
|
|
|
|
=== `to_date` ===
|
|
|
|
Converts a date string to a YYYY-MM-DD one suitable for !DateField
|
|
|
|
'''Decision''': Obsolete. !DateField doesn't exists anymore.
|
|
|
|
== extraction.py ==
|
|
|
|
=== `extract` ===
|
|
|
|
This adaptor tries to extract data from the given locations. Any XPathSelector in it will be extracted, and any other data will be added as-is to the result.
|
|
|
|
'''Decision''': Obsolete. Functionality included in !XpathLoader.
|
|
|
|
=== `ExtractImageLinks` ===
|
|
|
|
This adaptor may receive either XPathSelectors pointing to the desired locations for finding image urls, or just a list of XPath expressions (which will be turned into selectors anyway).
|
|
|
|
'''Decision''': XXX
|
|
|
|
== markup.py ==
|
|
|
|
=== `remove_tags` ===
|
|
|
|
Factory that returns an adaptor for removing each tag in the `tags` parameter found in the given value. If no `tags` are specified, all of them are removed.
|
|
|
|
'''Decision''': XXX
|
|
|
|
=== `remove_root` ===
|
|
|
|
This adaptor removes the root tag of the given string/unicode, if it's found.
|
|
|
|
'''Decision''': XXX
|
|
|
|
=== `replace_escape` ===
|
|
|
|
Factory that returns an adaptor for removing/replacing each escape character in the `wich_ones` parameter found in the given value.
|
|
|
|
'''Decision''': XXX
|
|
|
|
=== `unquote` ===
|
|
|
|
This factory returns an adaptor that receives a string or unicode, removes all of the CDATAs and entities (except the ones in CDATAs, and the ones you specify in the `keep` parameter) and then, returns a new string or unicode.
|
|
|
|
'''Decision''': XXX
|
|
|
|
== misc.py ==
|
|
|
|
=== `to_unicode` ===
|
|
|
|
Receives a string and converts it to unicode using the given encoding (if specified, else utf-8 is used) and returns a new unicode object. E.g:
|
|
|
|
{{{
|
|
>> to_unicode('it costs 20\xe2\x82\xac, or 30\xc2\xa3')
|
|
[u'it costs 20\u20ac, or 30\xa3']
|
|
}}}
|
|
|
|
'''Decision''': XXX
|
|
|
|
=== `clean_spaces` ===
|
|
|
|
Converts multispaces into single spaces for the given string. E.g:
|
|
|
|
{{{
|
|
>> clean_spaces(u'Hello sir')
|
|
u'Hello sir'
|
|
}}}
|
|
|
|
'''Decision''': XXX
|
|
|
|
=== `drop_empty` ===
|
|
|
|
Removes any index that evaluates to None from the provided iterable. E.g:
|
|
|
|
{{{
|
|
>> drop_empty([0, 'this', None, 'is', False, 'an example'])
|
|
['this', 'is', 'an example']
|
|
}}}
|
|
|
|
'''Decision''': Obsolete. Functionality included in reducers.
|
|
|
|
=== `delist` ===
|
|
|
|
This factory returns and adaptor that joins an iterable with the specified delimiter.
|
|
|
|
'''Decision''': Obsolete. Functionality included in reducers.
|
|
|
|
=== `Regex` ===
|
|
|
|
This adaptor must receive either a list of strings or an XPathSelector and return a new list with the matches of the given strings with the given regular expression (which is passed by a keyword argument, and is mandatory for this adaptor).
|
|
|
|
'''Decision''': XXX |