scrapy/scrapy/trunk/docs/proposed/newitem.rst

=====
Items
=====

The goal of the scraping process is to obtain Items (aka Scraped Items) from
scraped pages.

An Item is a XXX

Scrapy represent this using a model with fields for Items, much like you'll do
in an ORM.

Let's see an example::

   class NewsItem(Item):
       url = StringField()
       headline = StringField()
       summary = StringField()
       content = StringField()
       published = DateField()

Using this may seen complicated at first, but gives you much power over scraped
data, like assigning defaults for fields that are not present in some pages,
performing validation, etc.

To use Items you instantiate them and then assign values to their attributes,
they will be converted to the expected Python types depending of the field
kind::

   ni = NewsItem()
   ni.url = 'http://www.news.com/news/1'
   ni.summary = 'Summary'
   ni.content = 'Content'
   ni.published = '2009-02-28'

============
ItemAdaptors
============

As you probably want to scrape the same kind of Items from many sources
(different websites, RSS feeds, etc.), Scrapy implements ItemAdaptors, they
allow you to adapt chunks of HTML or XML (selected using Selectors) to the
expected format of your Item fields.

An ItemAdaptor acts like a wrapper of an Item, you define an ItemAdaptor class,
set the Item class to wrap and assign a set of functions (adaptor functions) to be called when you assign a value to a field.

Here's an example of an ItemAdaptor for our previously created Item::

   class NewsAdaptor(ItemAdaptor):
       item_class = NewsItem

       url = adaptor(extract, remove_tags(), unquote(), strip)
       headline = adaptor(extract, remove_tags(), unquote(), strip)
       summary = adaptor(extract, remove_tags(), unquote(), strip)
       content = adaptor(extract, remove_tags(), unquote(), strip)

How do we use it? Let's see it in action in a Spider::

   def parse_newspage(self, response):
       xhs = HtmlXPathSelector(response)
       i = NewsAdaptor()

       i.url = response.url
       i.headline = xhs.x('//h1[@class="headline"]')
       i.summary = xhs.x('//div[@class="summary"]')
       i.content = xhs.x('//div[@id="body"]')
       # we intentionally left publish out of the example, see below for site
       # specific adaptors
       return [i]

What happens underneath?

When we assign a value to a ItemAdaptor field it passes for the chain of
functions defined previously in it's class, in this case, the value gets
extracted (note that we assign directly the value obtained from the Selector),
then tags will be removed, then the result will be unquoted, stripped and
finally assigned to the Item Field.

This final assignment is done in an internal instance of the Item on the
ItemAdaptor, that's why we can return an ItemAdaptor instead of an Item and
Scrapy will know how to extract the item from it.
many sources and formats are you scraping from.

A Item can have as many ItemAdaptors as you want it generally depends on how
many sources and formats are you scraping from.

ItemAdaptor inheritance
=======================

As we said before you generally want an ItemAdaptor for each different source of
data and maybe some for specific sites, inheritance make this really easy, let's
see an example of adapting HTML and XML::

   class NewsAdaptor(ItemAdaptor):
       item_class = NewsItem


   class HtmlNewsAdaptor(NewsAdaptor):
       url = adaptor(extract, remove_tags(), unquote(), strip)
       headline = adaptor(extract, remove_tags(), unquote(), strip)
       summary = adaptor(extract, remove_tags(), unquote(), strip)
       content = adaptor(extract, remove_tags(), unquote(), strip)
       published = adaptor(extract, remove_tags(), unquote(), strip)

       
   class XmlNewsAdaptor(HtmlNewsAdaptor):
       url = adaptor(extract, remove_root, strip)
       headline = adaptor(extract, remove_root, strip)
       summary = adaptor(extract, remove_root, strip)
       content = adaptor(extract, remove_root, strip)
       published = adaptor(extract, remove_root, strip)

As the publication dates in news sites generally use different formats we would
want one adaptor for a specific site or group of sites (that shares the same
date format) this is easily accomplished::

   class SpecificSiteNewsAdaptor(HtmlNewsAdaptor):
       published = adaptor(NewsAdaptor.published, to_date('%d.%m.%Y')) 

This adds the adaptor function ``to_date`` to the end of our chain of adaptors.

Note that ``SpecificSiteNewsAdaptor`` will inherit the field adaptations from
``HtmlNewsAdaptor``.
added intro/tutorial for new Item and ItemAdaptor --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40943 2009-03-01 12:28:55 +00:00			`=====`
			`Items`
			`=====`

			`The goal of the scraping process is to obtain Items (aka Scraped Items) from`
			`scraped pages.`

			`An Item is a XXX`

			`Scrapy represent this using a model with fields for Items, much like you'll do`
			`in an ORM.`

			`Let's see an example::`

			`class NewsItem(Item):`
			`url = StringField()`
			`headline = StringField()`
			`summary = StringField()`
			`content = StringField()`
			`published = DateField()`

			`Using this may seen complicated at first, but gives you much power over scraped`
			`data, like assigning defaults for fields that are not present in some pages,`
			`performing validation, etc.`

			`To use Items you instantiate them and then assign values to their attributes,`
			`they will be converted to the expected Python types depending of the field`
			`kind::`

			`ni = NewsItem()`
			`ni.url = 'http://www.news.com/news/1'`
			`ni.summary = 'Summary'`
			`ni.content = 'Content'`
			`ni.published = '2009-02-28'`

			`============`
			`ItemAdaptors`
			`============`

			`As you probably want to scrape the same kind of Items from many sources`
			`(different websites, RSS feeds, etc.), Scrapy implements ItemAdaptors, they`
			`allow you to adapt chunks of HTML or XML (selected using Selectors) to the`
			`expected format of your Item fields.`

			`An ItemAdaptor acts like a wrapper of an Item, you define an ItemAdaptor class,`
			`set the Item class to wrap and assign a set of functions (adaptor functions) to be called when you assign a value to a field.`

			`Here's an example of an ItemAdaptor for our previously created Item::`

			`class NewsAdaptor(ItemAdaptor):`
			`item_class = NewsItem`

			`url = adaptor(extract, remove_tags(), unquote(), strip)`
			`headline = adaptor(extract, remove_tags(), unquote(), strip)`
			`summary = adaptor(extract, remove_tags(), unquote(), strip)`
			`content = adaptor(extract, remove_tags(), unquote(), strip)`

			`How do we use it? Let's see it in action in a Spider::`

			`def parse_newspage(self, response):`
			`xhs = HtmlXPathSelector(response)`
			`i = NewsAdaptor()`

			`i.url = response.url`
			`i.headline = xhs.x('//h1[@class="headline"]')`
			`i.summary = xhs.x('//div[@class="summary"]')`
			`i.content = xhs.x('//div[@id="body"]')`
			`# we intentionally left publish out of the example, see below for site`
			`# specific adaptors`
			`return [i]`

			`What happens underneath?`

			`When we assign a value to a ItemAdaptor field it passes for the chain of`
			`functions defined previously in it's class, in this case, the value gets`
			`extracted (note that we assign directly the value obtained from the Selector),`
			`then tags will be removed, then the result will be unquoted, stripped and`
			`finally assigned to the Item Field.`

			`This final assignment is done in an internal instance of the Item on the`
			`ItemAdaptor, that's why we can return an ItemAdaptor instead of an Item and`
			`Scrapy will know how to extract the item from it.`
			`many sources and formats are you scraping from.`

			`A Item can have as many ItemAdaptors as you want it generally depends on how`
			`many sources and formats are you scraping from.`

			`ItemAdaptor inheritance`
			`=======================`

			`As we said before you generally want an ItemAdaptor for each different source of`
			`data and maybe some for specific sites, inheritance make this really easy, let's`
			`see an example of adapting HTML and XML::`

			`class NewsAdaptor(ItemAdaptor):`
			`item_class = NewsItem`


			`class HtmlNewsAdaptor(NewsAdaptor):`
			`url = adaptor(extract, remove_tags(), unquote(), strip)`
			`headline = adaptor(extract, remove_tags(), unquote(), strip)`
			`summary = adaptor(extract, remove_tags(), unquote(), strip)`
			`content = adaptor(extract, remove_tags(), unquote(), strip)`
			`published = adaptor(extract, remove_tags(), unquote(), strip)`


			`class XmlNewsAdaptor(HtmlNewsAdaptor):`
			`url = adaptor(extract, remove_root, strip)`
			`headline = adaptor(extract, remove_root, strip)`
			`summary = adaptor(extract, remove_root, strip)`
			`content = adaptor(extract, remove_root, strip)`
			`published = adaptor(extract, remove_root, strip)`

			`As the publication dates in news sites generally use different formats we would`
			`want one adaptor for a specific site or group of sites (that shares the same`
			`date format) this is easily accomplished::`

			`class SpecificSiteNewsAdaptor(HtmlNewsAdaptor):`
			`published = adaptor(NewsAdaptor.published, to_date('%d.%m.%Y'))`

			This adds the adaptor function ``to_date`` to the end of our chain of adaptors.

			Note that ``SpecificSiteNewsAdaptor`` will inherit the field adaptations from
			``HtmlNewsAdaptor``.