1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-25 04:04:21 +00:00

added some info on items pipeline on the tutorial

--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40795
This commit is contained in:
Ismael Carnales 2009-01-29 17:57:21 +00:00
parent 9634af9aa7
commit e6375c6a5b

View File

@ -342,5 +342,31 @@ Now doing a crawl on the dmoz.org domain yields ScrapedItems::
[dmoz/dmoz.org] DEBUG: Scraped ScrapedItem({'title': [u'XML Processing with Python'], 'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'], 'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n']}) in <http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
Item Pipeline
=============
Item Pipelines
==============
After an item has been scraped by a spider it is sent to the Item Pipeline
which allows us to hook our own components to perform some actions over the
scraped Items, the most common of these actios are:
* Clean the HTML in the Items' attributes
* Validate the Items
* Store the Items
We can write our own item pipeline component, by creating a simple Python class
that must define the following method:
.. method:: process_item(domain, item)
``domain`` is a string with the domain of the spider which scraped the item
``item`` is a :class:`scrapy.item.ScrapedItem` with the item scraped
This method is called for every item pipeline component and must either return
a ScrapedItem (or any descendant class) object on a succesfull action or raise
a :exception:`DropItem` exception (i.e: failing a validation test). Dropped
items are no longer processed by further pipeline components.
You must then add a list of the pipelines components that you want to be added
in the ITEM_PIPELINES setting in your project settings file.