scrapy/docs/topics/item-pipeline.rst

.. _topics-item-pipeline:

=============
Item Pipeline
=============

After an item has been scraped by a spider, it is sent to the Item Pipeline
which process it through several components that are executed sequentially.

Each item pipeline component (sometimes referred as just "Item Pipeline") is a
Python class that implements a simple method. They receive an Item and perform
an action over it, also deciding if the Item should continue through the
pipeline or be dropped and no longer processed.

Typical use for item pipelines are:

* cleansing HTML data
* validating scraped data (checking that the items contain certain fields)
* checking for duplicates (and dropping them)
* storing the scraped item in a database


Writing your own item pipeline
==============================

Writing your own item pipeline is easy. Each item pipeline component is a
single Python class that must implement the following method:

.. method:: process_item(item, spider)

   This method is called for every item pipeline component and must either return
   a :class:`~scrapy.item.Item` (or any descendant class) object or raise a
   :exc:`~scrapy.exceptions.DropItem` exception. Dropped items are no longer
   processed by further pipeline components.

   :param item: the item scraped
   :type item: :class:`~scrapy.item.Item` object

   :param spider: the spider which scraped the item
   :type spider: :class:`~scrapy.spider.Spider` object

Additionally, they may also implement the following methods:

.. method:: open_spider(spider)

   This method is called when the spider is opened.

   :param spider: the spider which was opened
   :type spider: :class:`~scrapy.spider.Spider` object

.. method:: close_spider(spider)

   This method is called when the spider is closed.

   :param spider: the spider which was closed
   :type spider: :class:`~scrapy.spider.Spider` object


Item pipeline example
=====================

Price validation and dropping items with no prices
--------------------------------------------------

Let's take a look at the following hypothetical pipeline that adjusts the ``price``
attribute for those items that do not include VAT (``price_excludes_vat``
attribute), and drops those items which don't contain a price::

    from scrapy.exceptions import DropItem

    class PricePipeline(object):

        vat_factor = 1.15

        def process_item(self, item, spider):
            if item['price']:
                if item['price_excludes_vat']:
                    item['price'] = item['price'] * self.vat_factor
                return item
            else:
                raise DropItem("Missing price in %s" % item)


Write items to a JSON file
--------------------------

The following pipeline stores all scraped items (from all spiders) into a a
single ``items.jl`` file, containing one item per line serialized in JSON
format::

   import json

   class JsonWriterPipeline(object):

       def __init__(self):
           self.file = open('items.jl', 'wb')

       def process_item(self, item, spider):
           line = json.dumps(dict(item)) + "\n"
           self.file.write(line)
           return item

.. note:: The purpose of JsonWriterPipeline is just to introduce how to write
   item pipelines. If you really want to store all scraped items into a JSON
   file you should use the :ref:`Feed exports <topics-feed-exports>`.

Duplicates filter
-----------------

A filter that looks for duplicate items, and drops those items that were
already processed. Let say that our items have an unique id, but our spider
returns multiples items with the same id::


    from scrapy.exceptions import DropItem

    class DuplicatesPipeline(object):

        def __init__(self):
            self.ids_seen = set()

        def process_item(self, item, spider):
            if item['id'] in self.ids_seen:
                raise DropItem("Duplicate item found: %s" % item)
            else:
                self.ids_seen.add(item['id'])
                return item


Activating an Item Pipeline component
=====================================

To activate an Item Pipeline component you must add its class to the
:setting:`ITEM_PIPELINES` setting, like in the following example::

   ITEM_PIPELINES = {
       'myproject.pipeline.PricePipeline': 300,
       'myproject.pipeline.JsonWriterPipeline': 800,
   }

The integer values you assign to classes in this setting determine the
order they run in- items go through pipelines from order number low to
high. It's customary to define these numbers in the 0-1000 range.
doc: several more improvements --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40635 2009-01-03 09:14:52 +00:00			`.. _topics-item-pipeline:`

renamed itempipeline.rst to item-pipeline.rst --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40626 2009-01-03 03:05:19 +00:00			`=============`
			`Item Pipeline`
			`=============`

Applied documentation patch provided by Lucian Ursu (closes #207) 2010-08-21 01:26:35 -03:00			`After an item has been scraped by a spider, it is sent to the Item Pipeline`
renamed itempipeline.rst to item-pipeline.rst --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40626 2009-01-03 03:05:19 +00:00			`which process it through several components that are executed sequentially.`

Updated Scrapy Tutorial to reference feed exports, instead a custom written pipeline, and extended item pipeline documentation to include a JSON writer. 2010-10-10 20:31:05 -02:00			`Each item pipeline component (sometimes referred as just "Item Pipeline") is a`
			`Python class that implements a simple method. They receive an Item and perform`
			`an action over it, also deciding if the Item should continue through the`
			`pipeline or be dropped and no longer processed.`
renamed itempipeline.rst to item-pipeline.rst --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40626 2009-01-03 03:05:19 +00:00
Updated Scrapy Tutorial to reference feed exports, instead a custom written pipeline, and extended item pipeline documentation to include a JSON writer. 2010-10-10 20:31:05 -02:00			`Typical use for item pipelines are:`

			`* cleansing HTML data`
			`* validating scraped data (checking that the items contain certain fields)`
			`* checking for duplicates (and dropping them)`
			`* storing the scraped item in a database`
renamed itempipeline.rst to item-pipeline.rst --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40626 2009-01-03 03:05:19 +00:00

			`Writing your own item pipeline`
			`==============================`

			`Writing your own item pipeline is easy. Each item pipeline component is a`
Some improvements to Item Pipeline (closes #195): * Made Item Pipeline Manager a subclass of scrapy.middleware.MiddlewareManager * Added open_spider/close_spider methods with support for returning deferreds from them * Inverted the process_item() arguments to be more friendly with deferred callbacks (backwards compatibility kept through arguments introspection) * Updated documentation with new methods and process_item() arguments change 2010-08-12 10:48:37 -03:00			`single Python class that must implement the following method:`
renamed itempipeline.rst to item-pipeline.rst --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40626 2009-01-03 03:05:19 +00:00
Some improvements to Item Pipeline (closes #195): * Made Item Pipeline Manager a subclass of scrapy.middleware.MiddlewareManager * Added open_spider/close_spider methods with support for returning deferreds from them * Inverted the process_item() arguments to be more friendly with deferred callbacks (backwards compatibility kept through arguments introspection) * Updated documentation with new methods and process_item() arguments change 2010-08-12 10:48:37 -03:00			`.. method:: process_item(item, spider)`
renamed itempipeline.rst to item-pipeline.rst --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40626 2009-01-03 03:05:19 +00:00
Some improvements to Item Pipeline (closes #195): * Made Item Pipeline Manager a subclass of scrapy.middleware.MiddlewareManager * Added open_spider/close_spider methods with support for returning deferreds from them * Inverted the process_item() arguments to be more friendly with deferred callbacks (backwards compatibility kept through arguments introspection) * Updated documentation with new methods and process_item() arguments change 2010-08-12 10:48:37 -03:00			`This method is called for every item pipeline component and must either return`
			a :class:`~scrapy.item.Item` (or any descendant class) object or raise a
			:exc:`~scrapy.exceptions.DropItem` exception. Dropped items are no longer
			`processed by further pipeline components.`
somes fixes and updates to scrapy documentation --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40669 2009-01-07 03:59:39 +00:00
Correct param formatting in item pipelines doc 2009-11-30 11:04:15 -02:00			`:param item: the item scraped`
			:type item: :class:`~scrapy.item.Item` object
renamed itempipeline.rst to item-pipeline.rst --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40626 2009-01-03 03:05:19 +00:00
Some improvements to Item Pipeline (closes #195): * Made Item Pipeline Manager a subclass of scrapy.middleware.MiddlewareManager * Added open_spider/close_spider methods with support for returning deferreds from them * Inverted the process_item() arguments to be more friendly with deferred callbacks (backwards compatibility kept through arguments introspection) * Updated documentation with new methods and process_item() arguments change 2010-08-12 10:48:37 -03:00			`:param spider: the spider which scraped the item`
Rename BaseSpider to Spider. See GH-495. 2013-12-28 00:47:32 +06:00			:type spider: :class:`~scrapy.spider.Spider` object
Some improvements to Item Pipeline (closes #195): * Made Item Pipeline Manager a subclass of scrapy.middleware.MiddlewareManager * Added open_spider/close_spider methods with support for returning deferreds from them * Inverted the process_item() arguments to be more friendly with deferred callbacks (backwards compatibility kept through arguments introspection) * Updated documentation with new methods and process_item() arguments change 2010-08-12 10:48:37 -03:00
			`Additionally, they may also implement the following methods:`

			`.. method:: open_spider(spider)`

			`This method is called when the spider is opened.`

			`:param spider: the spider which was opened`
Rename BaseSpider to Spider. See GH-495. 2013-12-28 00:47:32 +06:00			:type spider: :class:`~scrapy.spider.Spider` object
Some improvements to Item Pipeline (closes #195): * Made Item Pipeline Manager a subclass of scrapy.middleware.MiddlewareManager * Added open_spider/close_spider methods with support for returning deferreds from them * Inverted the process_item() arguments to be more friendly with deferred callbacks (backwards compatibility kept through arguments introspection) * Updated documentation with new methods and process_item() arguments change 2010-08-12 10:48:37 -03:00
			`.. method:: close_spider(spider)`

			`This method is called when the spider is closed.`

			`:param spider: the spider which was closed`
Rename BaseSpider to Spider. See GH-495. 2013-12-28 00:47:32 +06:00			:type spider: :class:`~scrapy.spider.Spider` object
renamed itempipeline.rst to item-pipeline.rst --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40626 2009-01-03 03:05:19 +00:00

			`Item pipeline example`
			`=====================`

Updated Scrapy Tutorial to reference feed exports, instead a custom written pipeline, and extended item pipeline documentation to include a JSON writer. 2010-10-10 20:31:05 -02:00			`Price validation and dropping items with no prices`
			`--------------------------------------------------`

fixes spelling errors in documentation 2013-01-22 14:52:18 -08:00			Let's take a look at the following hypothetical pipeline that adjusts the ``price``
renamed itempipeline.rst to item-pipeline.rst --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40626 2009-01-03 03:05:19 +00:00			attribute for those items that do not include VAT (``price_excludes_vat``
			`attribute), and drops those items which don't contain a price::`

moved scrapy.core.exceptions to scrapy.exceptions, keeping backwards compatibility --HG-- rename : scrapy/core/exceptions.py => scrapy/exceptions.py 2010-08-10 17:36:48 -03:00			`from scrapy.exceptions import DropItem`
renamed itempipeline.rst to item-pipeline.rst --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40626 2009-01-03 03:05:19 +00:00
			`class PricePipeline(object):`

			`vat_factor = 1.15`

Some improvements to Item Pipeline (closes #195): * Made Item Pipeline Manager a subclass of scrapy.middleware.MiddlewareManager * Added open_spider/close_spider methods with support for returning deferreds from them * Inverted the process_item() arguments to be more friendly with deferred callbacks (backwards compatibility kept through arguments introspection) * Updated documentation with new methods and process_item() arguments change 2010-08-12 10:48:37 -03:00			`def process_item(self, item, spider):`
moved scrapy.newitem to scrapy.item and declared newitem api officially stable. updated docs and example project. deprecated old ScrapedItem 2009-08-19 21:39:58 -03:00			`if item['price']:`
			`if item['price_excludes_vat']:`
			`item['price'] = item['price'] * self.vat_factor`
renamed itempipeline.rst to item-pipeline.rst --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40626 2009-01-03 03:05:19 +00:00			`return item`
			`else:`
			`raise DropItem("Missing price in %s" % item)`

pipeline: remove open_domain/close_domain hooks, use domain_open and domain_closed signals instead. docs updated. --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40798 2009-01-29 19:18:03 +00:00
Updated Scrapy Tutorial to reference feed exports, instead a custom written pipeline, and extended item pipeline documentation to include a JSON writer. 2010-10-10 20:31:05 -02:00			`Write items to a JSON file`
			`--------------------------`

			`The following pipeline stores all scraped items (from all spiders) into a a`
			single ``items.jl`` file, containing one item per line serialized in JSON
			`format::`

			`import json`

			`class JsonWriterPipeline(object):`

			`def __init__(self):`
			`self.file = open('items.jl', 'wb')`

			`def process_item(self, item, spider):`
			`line = json.dumps(dict(item)) + "\n"`
			`self.file.write(line)`
			`return item`

			`.. note:: The purpose of JsonWriterPipeline is just to introduce how to write`
			`item pipelines. If you really want to store all scraped items into a JSON`
			file you should use the :ref:`Feed exports <topics-feed-exports>`.

Updated documentation after singleton removal changes. Also removed some unused code and made some minor additional refactoring. 2012-08-28 18:31:03 -03:00			`Duplicates filter`
			`-----------------`
pipeline: remove open_domain/close_domain hooks, use domain_open and domain_closed signals instead. docs updated. --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40798 2009-01-29 19:18:03 +00:00
Updated documentation after singleton removal changes. Also removed some unused code and made some minor additional refactoring. 2012-08-28 18:31:03 -03:00			`A filter that looks for duplicate items, and drops those items that were`
			`already processed. Let say that our items have an unique id, but our spider`
			`returns multiples items with the same id::`
pipeline: remove open_domain/close_domain hooks, use domain_open and domain_closed signals instead. docs updated. --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40798 2009-01-29 19:18:03 +00:00

moved scrapy.core.exceptions to scrapy.exceptions, keeping backwards compatibility --HG-- rename : scrapy/core/exceptions.py => scrapy/exceptions.py 2010-08-10 17:36:48 -03:00			`from scrapy.exceptions import DropItem`
pipeline: remove open_domain/close_domain hooks, use domain_open and domain_closed signals instead. docs updated. --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40798 2009-01-29 19:18:03 +00:00
			`class DuplicatesPipeline(object):`

Updated documentation after singleton removal changes. Also removed some unused code and made some minor additional refactoring. 2012-08-28 18:31:03 -03:00			`def __init__(self):`
			`self.ids_seen = set()`
pipeline: remove open_domain/close_domain hooks, use domain_open and domain_closed signals instead. docs updated. --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40798 2009-01-29 19:18:03 +00:00
Some improvements to Item Pipeline (closes #195): * Made Item Pipeline Manager a subclass of scrapy.middleware.MiddlewareManager * Added open_spider/close_spider methods with support for returning deferreds from them * Inverted the process_item() arguments to be more friendly with deferred callbacks (backwards compatibility kept through arguments introspection) * Updated documentation with new methods and process_item() arguments change 2010-08-12 10:48:37 -03:00			`def process_item(self, item, spider):`
Updated documentation after singleton removal changes. Also removed some unused code and made some minor additional refactoring. 2012-08-28 18:31:03 -03:00			`if item['id'] in self.ids_seen:`
pipeline: remove open_domain/close_domain hooks, use domain_open and domain_closed signals instead. docs updated. --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40798 2009-01-29 19:18:03 +00:00			`raise DropItem("Duplicate item found: %s" % item)`
			`else:`
Updated documentation after singleton removal changes. Also removed some unused code and made some minor additional refactoring. 2012-08-28 18:31:03 -03:00			`self.ids_seen.add(item['id'])`
pipeline: remove open_domain/close_domain hooks, use domain_open and domain_closed signals instead. docs updated. --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40798 2009-01-29 19:18:03 +00:00			`return item`
Updated documentation after singleton removal changes. Also removed some unused code and made some minor additional refactoring. 2012-08-28 18:31:03 -03:00

			`Activating an Item Pipeline component`
			`=====================================`

			`To activate an Item Pipeline component you must add its class to the`
Make ITEM_PIPELINE setting a dict This is for consistency with how spider and downloader middlewares are defined. ITEM_PIPELINE_BASE was also added and both remain empty. Backwards compatibility is kept (with a warning) with list-based ITEM_PIPELINES. 2013-09-23 16:41:58 -03:00			:setting:`ITEM_PIPELINES` setting, like in the following example::
Updated documentation after singleton removal changes. Also removed some unused code and made some minor additional refactoring. 2012-08-28 18:31:03 -03:00
Make ITEM_PIPELINE setting a dict This is for consistency with how spider and downloader middlewares are defined. ITEM_PIPELINE_BASE was also added and both remain empty. Backwards compatibility is kept (with a warning) with list-based ITEM_PIPELINES. 2013-09-23 16:41:58 -03:00			`ITEM_PIPELINES = {`
			`'myproject.pipeline.PricePipeline': 300,`
			`'myproject.pipeline.JsonWriterPipeline': 800,`
			`}`
Add note to item-pipeline documentation explaining order in the ITEM_PIPELINES setting. 2013-11-19 16:12:54 -06:00
			`The integer values you assign to classes in this setting determine the`
Elaborate on use of order numbers 2013-11-19 17:51:50 -06:00			`order they run in- items go through pipelines from order number low to`
			`high. It's customary to define these numbers in the 0-1000 range.`
Add note to item-pipeline documentation explaining order in the ITEM_PIPELINES setting. 2013-11-19 16:12:54 -06:00