1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-26 19:03:53 +00:00

reduced introduction text in proposed doc

--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40836
This commit is contained in:
Ismael Carnales 2009-02-09 15:06:22 +00:00
parent d98f35af94
commit 844b59b5b3

View File

@ -12,75 +12,38 @@ Overview
:height: 468
:alt: Scrapy architecture
.. _items:
Items
-----
In Scrapy, Items are the placeholder to use for the scraped data. They are
represented by a :class:`~scrapy.item.ScrapedItem` object, or any subclass
instance, and store the information in instance attributes.
.. _request-response:
Requests and Responses
----------------------
Scrapy uses :class:`~scrapy.http.Request` and :class:`~scrapy.http.Response`
objects for crawling web sites.
Scrapy uses *Requests* and *Responses* for crawling web sites.
Generally, :class:`~scrapy.http.Request` objects are generated in the
:ref:`Spiders <spiders>` (although they can be generated in any component of
the framework), then they pass across the system until they reach the
Downloader, which actually executes the request and returns a
:class:`~scrapy.http.Response` object to the :class:`Request's callback
function <scrapy.http.Request>`.
.. _overview-spiders:
Generally, *Requests* are generated in the Spiders and pass across the system
until they reach the *Downloader*, which executes the *Request* and returns a
*Response* which goes back to the Spider that generated the *Request*.
Spiders
-------
Spiders are user written classes which define how a certain site (or domain)
will be scraped; including how to crawl the site and how to scrape :ref:`Items
<items>` from their pages.
Spiders are user written classes to scrape information from a domain (or group
of domains).
All Spiders must be descendant of :class:`~scrapy.spider.BaseSpider` or any
subclass of it, in :ref:`ref-spiders` you can see a list of available Spiders
in Scrapy.
.. _selectors:
They define an initial set of URLs (or Requests) to download, how to crawl the
domain and how to scrape *Items* from their pages.
Selectors
---------
Items
-----
Selectors are the recommended tool to extract information from documents. They
retrieve information from the :ref:`Response <request-response>` body using
`XPath <http://www.w3.org/TR/xpath>`_, a language for finding information in a
XML document navigating trough its elements and attributes.
Items are the placeholder to use for the scraped data. They are represented by a
simple Python class.
Scrapy defines a class :class:`~scrapy.xpath.XPathSelector`, that comes in two
flavours, :class:`~scrapy.xpath.HtmlXPatSelector` (for HTML) and
:class:`~scrapy.xpath.XmlXPathSelector` (for XML). In order to use them you
must instantiate the desired class with a :ref:`Response <request-response>`
object.
You can see selectors as objects that represents nodes in the document
structure. So, the first instantiated selectors are associated to the root
node, or the entire document.
.. _item-pipeline:
After an Item has been scraped by a Spider, it is sent to the Item Pipeline for further proccesing.
Item Pipeline
-------------
After an :ref:`Item <items>` has been scraped by a :ref:`Spider <spiders>`, it
is sent to the Item Pipeline which allows us to perform some actions over the
:ref:`scrapped Items <items>`.
The Item Pipeline is a list of user written Python classes that implement a
specific method , which is called sequentially for every element of the
Pipeline.
specific method, which is called sequentially for every element of the Pipeline.
Each element receives the Scraped Item, do an action upon it (like validating,
checking for duplicates, store the item), and then decide if the Item
continues trough the Pipeline or the item is dropped.
checking for duplicates, store the item), and then decide if the Item continues
trough the Pipeline or the item is dropped.