mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-26 15:04:37 +00:00
reduced introduction text in proposed doc
--HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40836
This commit is contained in:
parent
d98f35af94
commit
844b59b5b3
@ -12,75 +12,38 @@ Overview
|
||||
:height: 468
|
||||
:alt: Scrapy architecture
|
||||
|
||||
.. _items:
|
||||
|
||||
Items
|
||||
-----
|
||||
|
||||
In Scrapy, Items are the placeholder to use for the scraped data. They are
|
||||
represented by a :class:`~scrapy.item.ScrapedItem` object, or any subclass
|
||||
instance, and store the information in instance attributes.
|
||||
|
||||
.. _request-response:
|
||||
|
||||
Requests and Responses
|
||||
----------------------
|
||||
|
||||
Scrapy uses :class:`~scrapy.http.Request` and :class:`~scrapy.http.Response`
|
||||
objects for crawling web sites.
|
||||
Scrapy uses *Requests* and *Responses* for crawling web sites.
|
||||
|
||||
Generally, :class:`~scrapy.http.Request` objects are generated in the
|
||||
:ref:`Spiders <spiders>` (although they can be generated in any component of
|
||||
the framework), then they pass across the system until they reach the
|
||||
Downloader, which actually executes the request and returns a
|
||||
:class:`~scrapy.http.Response` object to the :class:`Request's callback
|
||||
function <scrapy.http.Request>`.
|
||||
|
||||
.. _overview-spiders:
|
||||
Generally, *Requests* are generated in the Spiders and pass across the system
|
||||
until they reach the *Downloader*, which executes the *Request* and returns a
|
||||
*Response* which goes back to the Spider that generated the *Request*.
|
||||
|
||||
Spiders
|
||||
-------
|
||||
|
||||
Spiders are user written classes which define how a certain site (or domain)
|
||||
will be scraped; including how to crawl the site and how to scrape :ref:`Items
|
||||
<items>` from their pages.
|
||||
Spiders are user written classes to scrape information from a domain (or group
|
||||
of domains).
|
||||
|
||||
All Spiders must be descendant of :class:`~scrapy.spider.BaseSpider` or any
|
||||
subclass of it, in :ref:`ref-spiders` you can see a list of available Spiders
|
||||
in Scrapy.
|
||||
.. _selectors:
|
||||
They define an initial set of URLs (or Requests) to download, how to crawl the
|
||||
domain and how to scrape *Items* from their pages.
|
||||
|
||||
Selectors
|
||||
---------
|
||||
Items
|
||||
-----
|
||||
|
||||
Selectors are the recommended tool to extract information from documents. They
|
||||
retrieve information from the :ref:`Response <request-response>` body using
|
||||
`XPath <http://www.w3.org/TR/xpath>`_, a language for finding information in a
|
||||
XML document navigating trough its elements and attributes.
|
||||
Items are the placeholder to use for the scraped data. They are represented by a
|
||||
simple Python class.
|
||||
|
||||
Scrapy defines a class :class:`~scrapy.xpath.XPathSelector`, that comes in two
|
||||
flavours, :class:`~scrapy.xpath.HtmlXPatSelector` (for HTML) and
|
||||
:class:`~scrapy.xpath.XmlXPathSelector` (for XML). In order to use them you
|
||||
must instantiate the desired class with a :ref:`Response <request-response>`
|
||||
object.
|
||||
|
||||
You can see selectors as objects that represents nodes in the document
|
||||
structure. So, the first instantiated selectors are associated to the root
|
||||
node, or the entire document.
|
||||
|
||||
.. _item-pipeline:
|
||||
After an Item has been scraped by a Spider, it is sent to the Item Pipeline for further proccesing.
|
||||
|
||||
Item Pipeline
|
||||
-------------
|
||||
|
||||
After an :ref:`Item <items>` has been scraped by a :ref:`Spider <spiders>`, it
|
||||
is sent to the Item Pipeline which allows us to perform some actions over the
|
||||
:ref:`scrapped Items <items>`.
|
||||
|
||||
The Item Pipeline is a list of user written Python classes that implement a
|
||||
specific method , which is called sequentially for every element of the
|
||||
Pipeline.
|
||||
specific method, which is called sequentially for every element of the Pipeline.
|
||||
|
||||
Each element receives the Scraped Item, do an action upon it (like validating,
|
||||
checking for duplicates, store the item), and then decide if the Item
|
||||
continues trough the Pipeline or the item is dropped.
|
||||
checking for duplicates, store the item), and then decide if the Item continues
|
||||
trough the Pipeline or the item is dropped.
|
||||
|
Loading…
x
Reference in New Issue
Block a user