1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-24 03:23:57 +00:00
scrapy/docs/experimental/newitems.rst

254 lines
7.0 KiB
ReStructuredText

.. _topics-newitems:
=================
Items (version 2)
=================
.. module:: scrapy.newitem
:synopsis: Item and Field classes
The main goal in scraping is to extract structured data from unstructured
sources, typically, web pages. Scrapy provides the :class:`Item` class for this
purpose.
:class:`Item` objects are simple containers used to collect the scraped data.
They provide a `dictionary-like`_ API with a convenient syntax for declaring
their available fields.
.. _dictionary-like: http://docs.python.org/library/stdtypes.html#dict
.. _topics-newitems-declaring:
Declaring Items
===============
Items are declared using a simple class definition syntax and :class:`Field`
objects. Here is an example::
from scrapy.newitem import Item, Field
class Product(Item):
name = Field()
price = Field()
stock = Field(default=0)
last_updated = Field()
.. note:: Those familiar with `Django`_ will notice that Scrapy Items are
declared similar to `Django Models`_, except that Scrapy Items are much
simpler as there is no concept of different field types.
.. _Django: http://www.djangoproject.com/
.. _Django Models: http://docs.djangoproject.com/en/dev/topics/db/models/
.. _topics-newitems-fields:
Item Fields
===========
:class:`Field` objects are used to specify metadata for each field. For
example, the default value for the ``stock`` field illustrated in the example
above.
You can specify any kind of metadata for each field. There is no restriction on
the values accepted by :class:`Field` objects. For this same
reason, there isn't a reference list of all available metadata keys. Each key
defined in :class:`Field` objects could be used by a different components, and
only those components know about it. You can also define and use any other
:class:`Field` key in your project too, for your own needs. The main goal of
:class:`Field` objects is to provide a way to define all field metadata in one
place. Typically, those components whose behaviour depends on each field, use
certain field keys to configure that behaviour. You must refer to their
documentation to see which metadata keys are used by each component.
It's important to note that the :class:`Field` objects used to declare the item
do not stay assigned as class attributes. Instead, they can be accesed through
the :attr:`Item.fields` attribute.
And that's all you need to know about declaring items.
Working with Items
==================
Here are some examples of common tasks performed with items, using the
``Product`` item :ref:`declared above <topics-newitems-declaring>`. You will
notice the API is very similar to the `dict API`_.
Creating items
--------------
::
>>> product = Product(name='Desktop PC', price=1000)
>>> print product
Product(name='Desktop PC', price=1000)
Getting field values
--------------------
::
>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC
>>> product['price']
1000
>>> product['stock'] # getting field with default value
0
>>> product['last_updated'] # getting field with no default value
Traceback (most recent call last):
...
KeyError: 'last_updated'
>>> product.get('last_updated', 'not set')
not set
>>> product['lala'] # getting unknown field
Traceback (most recent call last):
...
KeyError: 'lala'
>>> product.get('lala', 'unknown field')
'unknown field'
>>> 'name' in product # is name field populated?
True
>>> 'last_updated' in product # is last_updated populated?
False
>>> 'last_updated' in product.fields # is last_updated a declared field?
True
>>> 'lala' in product.fields # is lala a declared field?
False
Setting field values
--------------------
::
>>> product['last_updated'] = 'today'
>>> product['last_updated']
today
>>> product['lala'] = 'test' # setting unknown field
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
Accesing all populated values
-----------------------------
To access all populated values just use the typical `dict API`_::
>>> product.keys()
['price', 'name']
>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]
Other common tasks
------------------
Copying items::
>>> product2 = Product(product)
>>> print product2
Product(name='Desktop PC', price=1000)
Creating dicts from items::
>>> dict(product) # create a dict from all populated values
{'price': 1000, 'name': 'Desktop PC'}
Creating items from dicts::
>>> Product({'name': 'Laptop PC', 'price': 1500})
Product(price=1500, name='Laptop PC')
>>> Product({'name': 'Laptop PC', 'lala': 1500}) # warning: unknown field in dict
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
Default values
==============
The only field metadata key supported by Items themselves is ``default``, which
specifies the default value to return when trying to access a field which
wasn't populated before.
So, for the ``Product`` item declared above::
>>> product = Product()
>>> product['stock'] # field with default value
0
>>> product['name'] # field with no default value
Traceback (most recent call last):
...
KeyError: 'name'
>>> product.get('name') is None
True
Extending Items
===============
You can extend Items (to add more fields or to change some metadata for some
fields) by declaring a subclass of your original Item.
For example::
class DiscountedProduct(Product):
discount_percent = Field(default=0)
discount_expiration_date = Field()
You can also extend field metadata by using the previous field metadata and
appending more values, or changing existing values, like this::
class SpecificProduct(Product):
name = Field(Product.fields['name'], default='product')
That adds (or replaces) the ``default`` metadata key for the ``name`` field,
keeping all the previously existing metadata values.
Item objects
============
.. class:: Item([arg])
Return a new Item optionally initialized from the given argument.
Items replicate the standard `dict API`_, including its constructor. The
only additional attribute provided by Items is:
.. attribute:: fields
A dictionary containing *all declared fields* for this Item, not only
those populated. The keys are the field names and the values are the
:class:`Field` objects used in the :ref:`Item declaration
<topics-newitems-declaring>`.
.. _dict API: http://docs.python.org/library/stdtypes.html#dict
Field objects
=============
.. class:: Field([arg])
The :class:`Field` class is just an alias to the built-in `dict`_ class and
doesn't provide any extra functionality or attributes. In other words,
:class:`Field` objects are plain-old Python dicts. A separate class is used
to support the :ref:`item declaration syntax <topics-newitems-declaring>`
based on class attributes.
.. _dict: http://docs.python.org/library/stdtypes.html#dict