first commit of doc apps

--HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40425
2025-02-24 11:24:10 +00:00 · 2008-11-25 18:16:04 +00:00 · 2008-11-25 18:16:04 +00:00 · 89bd6ede43
commit 89bd6ede43
parent 4241171947
37 changed files with 3513 additions and 1 deletions
--- a/sites/scrapy.org/docs/pickle/.doctrees/basics.doctree
+++ b/sites/scrapy.org/docs/pickle/.doctrees/basics.doctree
--- a/sites/scrapy.org/docs/pickle/.doctrees/environment.pickle
+++ b/sites/scrapy.org/docs/pickle/.doctrees/environment.pickle
--- a/sites/scrapy.org/docs/pickle/.doctrees/index.doctree
+++ b/sites/scrapy.org/docs/pickle/.doctrees/index.doctree
--- a/sites/scrapy.org/docs/pickle/.doctrees/tutorial/index.doctree
+++ b/sites/scrapy.org/docs/pickle/.doctrees/tutorial/index.doctree
--- a/sites/scrapy.org/docs/pickle/.doctrees/tutorial/tutorial1.doctree
+++ b/sites/scrapy.org/docs/pickle/.doctrees/tutorial/tutorial1.doctree
--- a/sites/scrapy.org/docs/pickle/.doctrees/tutorial/tutorial2.doctree
+++ b/sites/scrapy.org/docs/pickle/.doctrees/tutorial/tutorial2.doctree
--- a/sites/scrapy.org/docs/pickle/.doctrees/tutorial/tutorial3.doctree
+++ b/sites/scrapy.org/docs/pickle/.doctrees/tutorial/tutorial3.doctree
--- a/sites/scrapy.org/docs/pickle/basics.fpickle
+++ b/sites/scrapy.org/docs/pickle/basics.fpickle
--- a/sites/scrapy.org/docs/pickle/environment.pickle
+++ b/sites/scrapy.org/docs/pickle/environment.pickle
--- a/sites/scrapy.org/docs/pickle/genindex.fpickle
+++ b/sites/scrapy.org/docs/pickle/genindex.fpickle
@ -0,0 +1 @@
+€}q(Ugenindexcountsq]qUsplit_indexq‰Ugenindexentriesq]qUcurrent_page_nameqUgenindexqu.
--- a/sites/scrapy.org/docs/pickle/globalcontext.pickle
+++ b/sites/scrapy.org/docs/pickle/globalcontext.pickle
--- a/sites/scrapy.org/docs/pickle/index.fpickle
+++ b/sites/scrapy.org/docs/pickle/index.fpickle
--- a/sites/scrapy.org/docs/pickle/last_build
+++ b/sites/scrapy.org/docs/pickle/last_build
--- a/sites/scrapy.org/docs/pickle/objects.inv
+++ b/sites/scrapy.org/docs/pickle/objects.inv
@ -0,0 +1,3 @@
+# Sphinx inventory version 1
+# Project: Scrapy
+# Version: 1.0
--- a/sites/scrapy.org/docs/pickle/search.fpickle
+++ b/sites/scrapy.org/docs/pickle/search.fpickle
@ -0,0 +1 @@
+€}qUcurrent_page_nameqUsearchqs.
--- a/sites/scrapy.org/docs/pickle/searchindex.pickle
+++ b/sites/scrapy.org/docs/pickle/searchindex.pickle
--- a/sites/scrapy.org/docs/pickle/tutorial/index.fpickle
+++ b/sites/scrapy.org/docs/pickle/tutorial/index.fpickle
--- a/sites/scrapy.org/docs/pickle/tutorial/tutorial1.fpickle
+++ b/sites/scrapy.org/docs/pickle/tutorial/tutorial1.fpickle
--- a/sites/scrapy.org/docs/pickle/tutorial/tutorial2.fpickle
+++ b/sites/scrapy.org/docs/pickle/tutorial/tutorial2.fpickle
--- a/sites/scrapy.org/docs/pickle/tutorial/tutorial3.fpickle
+++ b/sites/scrapy.org/docs/pickle/tutorial/tutorial3.fpickle
--- a/sites/scrapy.org/docs/sources/basics.rst
+++ b/sites/scrapy.org/docs/sources/basics.rst
@ -0,0 +1,47 @@
+======
+Basics
+======
+
+Scrapy is built up on `Twisted <http://twistedmatrix.com>`_ , a python platform for the developing of network applications in an asynchronous (non-blocking) approach. This means that, while running in a single thread, a Scrapy application does not block while waiting data arrival from the network. Instead, it continues to process any task that requires CPU attention. Then, when data arrives, a callback function is called with this data as parameter. This is why this kind of code is also called event-driven or callback-based. For more information about this issue see `Asynchronous Programming with Twisted <http://twistedmatrix.com/projects/core/documentation/howto/async.html>`_.
+
+When coding with Scrapy, the elemental scheme loop is: 
+
+1. Feed the application with an initial set of urls.
+2. Create a [source:scrapy/trunk/scrapy/http/request.py Request] object for each given url.
+3. Attach callback functions to each Request, so you define what to do with data once arrives.
+4. Feed the [source:scrapy/trunk/scrapy/core/engine.py Execution Engine] with a list of Requests and data.
+
+Eventually, the callbacks can return data or more Request objects to the Execution Engine, so the crawling process will continue until no more request left.
+
+At this point, it may seem that Scrapy can't do anything that cannot be done easily with Twisted. But the magic of Scrapy resides on that it helps the developer to implement this basic scheme under a very simple and straightforward model and, most important, on all the fully integrated helpers that places at developer disposal in order to perform intensive network crawling and quickly deploy a fully functional application.
+
+These helpers includes
+
+* Parsers/interpreters for most common languages found in the web (xml/html/csv/javascript).
+* Network request and response pre and post process pluggable middleware to comply with specific network tasks, such as HTTP authentication, redirection, compression, network caching, network debug info logging, download retrying, cookie management, etc.
+* Scraping results post process pluggable middleware.
+* Pluggable extensions such as memory debugging, memory usage control and profiling, cpu profiling, controlling web/webservice and telnet consoles, distributed crawling cluster, scraping statistics and lot more.
+* Built-in event signaling.
+* Spiders quality test tools.
+* Lots of useful utilities for many kinds of data processing.
+
+In order to implement all these resources, Scrapy heavily exploits the concept of middleware. Middleware is plugin code organized in a pipeline model, that resides in the execution line between the spiders, the engine and the Web.
+
+So, if you don't want to start from the very begining a serious production crawling/scraping project, Scrapy is for you.
+
+Getting Scrapy and starting a new project 
+=========================================
+
+Scrapy is under heavy development, so you will want to download it from his svn repository::
+
+  svn co http://svn.scrapy.org/scrapy
+
+This command will create a folder with the name *scrapy* You can find **scrapy** module under *scrapy/trunk* so in order to import this module, assure you have its absolute path in your PYTHONPATH environment variable.
+
+In order to start a new project, Scrapy gives you a script tool that helps to generate a basic folder structure for the new project, an initial settings file, and a control script.
+
+Do::
+
+  $ scrapy/trunk/scrapy/bin/scrapy-admin.py startproject myproject
+
+You will see that a new folder, *myproject*, in your actual path, was created. Let's continue in the scrapy tutorial.
--- a/sites/scrapy.org/docs/sources/conf.py
+++ b/sites/scrapy.org/docs/sources/conf.py
@ -0,0 +1,187 @@
+# -*- coding: utf-8 -*-
+#
+# Scrapy documentation build configuration file, created by
+# sphinx-quickstart on Mon Nov 24 12:02:52 2008.
+#
+# This file is execfile()d with the current directory set to its containing dir.
+#
+# The contents of this file are pickled, so don't put values in the namespace
+# that aren't pickleable (module imports are okay, they're removed automatically).
+#
+# All configuration values have a default; values that are commented out
+# serve to show the default.
+
+import sys, os
+
+# If your extensions are in another directory, add it here. If the directory
+# is relative to the documentation root, use os.path.abspath to make it
+# absolute, like shown here.
+#sys.path.append(os.path.abspath('.'))
+
+# General configuration
+# ---------------------
+
+# Add any Sphinx extension module names here, as strings. They can be extensions
+# coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
+extensions = []
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['../templates/docs']
+
+# The suffix of source filenames.
+source_suffix = '.rst'
+
+# The encoding of source files.
+#source_encoding = 'utf-8'
+
+# The master toctree document.
+master_doc = 'index'
+
+# General information about the project.
+project = u'Scrapy'
+copyright = u'2008, Insophia'
+
+# The version info for the project you're documenting, acts as replacement for
+# |version| and |release|, also used in various other places throughout the
+# built documents.
+#
+# The short X.Y version.
+version = '1.0'
+# The full version, including alpha/beta/rc tags.
+release = '1.0'
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+language = 'en'
+
+# There are two options for replacing |today|: either, you set today to some
+# non-false value, then it is used:
+#today = ''
+# Else, today_fmt is used as the format for a strftime call.
+#today_fmt = '%B %d, %Y'
+
+# List of documents that shouldn't be included in the build.
+#unused_docs = []
+
+# List of directories, relative to source directory, that shouldn't be searched
+# for source files.
+exclude_trees = ['.build']
+
+# The reST default role (used for this markup: `text`) to use for all documents.
+#default_role = None
+
+# If true, '()' will be appended to :func: etc. cross-reference text.
+#add_function_parentheses = True
+
+# If true, the current module name will be prepended to all description
+# unit titles (such as .. function::).
+#add_module_names = True
+
+# If true, sectionauthor and moduleauthor directives will be shown in the
+# output. They are ignored by default.
+#show_authors = False
+
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = 'sphinx'
+
+
+# Options for HTML output
+# -----------------------
+
+# The style sheet to use for HTML and HTML Help pages. A file of that name
+# must exist either in Sphinx' static/ path, or in one of the custom paths
+# given in html_static_path.
+html_style = 'default.css'
+
+# The name for this set of Sphinx documents.  If None, it defaults to
+# "<project> v<release> documentation".
+#html_title = None
+
+# A shorter title for the navigation bar.  Default is the same as html_title.
+#html_short_title = None
+
+# The name of an image file (relative to this directory) to place at the top
+# of the sidebar.
+#html_logo = None
+
+# The name of an image file (within the static path) to use as favicon of the
+# docs.  This file should be a Windows icon file (.ico) being 16x16 or 32x32
+# pixels large.
+#html_favicon = None
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['.static']
+
+# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
+# using the given strftime format.
+#html_last_updated_fmt = '%b %d, %Y'
+
+# If true, SmartyPants will be used to convert quotes and dashes to
+# typographically correct entities.
+#html_use_smartypants = True
+
+# Custom sidebar templates, maps document names to template names.
+#html_sidebars = {}
+
+# Additional templates that should be rendered to pages, maps page names to
+# template names.
+#html_additional_pages = {}
+
+# If false, no module index is generated.
+#html_use_modindex = True
+
+# If false, no index is generated.
+#html_use_index = True
+
+# If true, the index is split into individual pages for each letter.
+#html_split_index = False
+
+# If true, the reST sources are included in the HTML build as _sources/<name>.
+html_copy_source = False
+
+# If true, an OpenSearch description file will be output, and all pages will
+# contain a <link> tag referring to it.  The value of this option must be the
+# base URL from which the finished HTML is served.
+#html_use_opensearch = ''
+
+# If nonempty, this is the file name suffix for HTML files (e.g. ".xhtml").
+#html_file_suffix = ''
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'Scrapydoc'
+
+
+# Options for LaTeX output
+# ------------------------
+
+# The paper size ('letter' or 'a4').
+#latex_paper_size = 'letter'
+
+# The font size ('10pt', '11pt' or '12pt').
+#latex_font_size = '10pt'
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title, author, document class [howto/manual]).
+latex_documents = [
+  ('index', 'Scrapy.tex', ur'Scrapy Documentation',
+   ur'Insophia', 'manual'),
+]
+
+# The name of an image file (relative to this directory) to place at the top of
+# the title page.
+#latex_logo = None
+
+# For "manual" documents, if this is true, then toplevel headings are parts,
+# not chapters.
+#latex_use_parts = False
+
+# Additional stuff for the LaTeX preamble.
+#latex_preamble = ''
+
+# Documents to append as an appendix to all manuals.
+#latex_appendices = []
+
+# If false, no module index is generated.
+#latex_use_modindex = True
--- a/sites/scrapy.org/docs/sources/index.rst
+++ b/sites/scrapy.org/docs/sources/index.rst
@ -0,0 +1,13 @@
+.. Scrapy documentation master file, created by sphinx-quickstart on Mon Nov 24 12:02:52 2008.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Scrapy's documentation
+======================
+
+Contents:
+
+.. toctree::
+
+   basics
+   tutorial/index
--- a/sites/scrapy.org/docs/sources/tutorial/index.rst
+++ b/sites/scrapy.org/docs/sources/tutorial/index.rst
@ -0,0 +1,9 @@
+========
+Tutorial
+========
+
+.. toctree::
+
+   tutorial1
+   tutorial2
+   tutorial3
--- a/sites/scrapy.org/docs/sources/tutorial/tutorial1.rst
+++ b/sites/scrapy.org/docs/sources/tutorial/tutorial1.rst
@ -0,0 +1,74 @@
+================
+Our first spider
+================
+
+Lets code our first spider. But before, lets check two important asociated settings in *conf/scrapy_settings.py* file::
+
+    SPIDER_MODULES = ['myproject.spiders']
+    ENABLED_SPIDERS_FILE = '%s/conf/enabled_spiders.list' % myproject.__path__[0]
+
+The first setting, SPIDER_MODULES, is a list that sets the modules that contains the spiders. The second one, ENABLED_SPIDERS_FILE, sets the location of a text file that contains the list of enabled spiders. When you created the project branch, the admin script set both for you. Hence, you will see that with these values, **Scrapy** will search spiders in the module *myproject.spiders* and will read the enabled spiders list from the path *path to myproject>/myproject/conf/enabled_spiders.list* where *<path to myproject>* is the path where the module *myproject* resides on. Of course, you can change this settings as taste. These are just defaults to help you.
+
+Now, finally, the code to our first spider::
+
+    from scrapy.spider import BaseSpider
+
+    class MySpider(BaseSpider):
+        
+        domain_name = "scrapy.org"
+
+        start_urls = ["http://dev.scrapy.org/wiki/WikiStart", "http://dev.scrapy.org/wiki/Starting"]
+        
+        def parse(self, response):
+            filename = response.url.split("/")[-1]
+            open(filename, "w").write(response.body.to_unicode())
+
+            return []
+            
+    CRAWLER = MySpider()
+
+The first line imports the class [source:scrapy/trunk/scrapy/spider/models.py BaseSpider]. For the purpose of creating a working spider, you must subclass !BaseSpider, and then define the three main, mandatory, attributes:
+
+* *domain_name* identifies the spider. It must be unique, that is, you can't set the same domain name for different spiders.
+* *start_urls* is a list of urls where the spider will begin to crawl from. So, the first pages downloaded will be those listed here. The subsequent urls will be generated successively from data contained in the start urls.
+* *parse* is the callback method of the spider. This means that each time a page is retrieved, the downloaded data will be passed to this method. In this simple example, the only action is to save the data. But anything can be done here: parse the data, organize it and store in a db or in the filesystem, process it, get new urls to continue the crawling process, etc.
+
+*parse()* method must always return a list. We will see why later.
+
+In the last line, we instantiate our spider class.
+
+So, save this code in a file named myfirst.py inside *myproject/spiders* folder, and create the file *conf/enabled_spiders.list* with the content::
+
+    scrapy.org
+
+to enable our new spider.
+
+Now, go to myproject base folder and run::
+
+    ./scrapy-ctl.py crawl
+
+The **crawl** subcommand runs all the enabled spiders. The output of this command will be something like::
+
+    2008/07/27 19:46 -0200 [-] Log opened.
+    2008/07/27 19:46 -0200 [scrapy-bot] INFO: Enabled extensions: TelnetConsole, WebConsole
+    2008/07/27 19:46 -0200 [scrapy-bot] INFO: Enabled downloader middlewares: 
+    2008/07/27 19:46 -0200 [scrapy-bot] INFO: Enabled spider middlewares: 
+    2008/07/27 19:46 -0200 [scrapy-bot] INFO: Enabled item pipelines: 
+    2008/07/27 19:46 -0200 [scrapy-bot/scrapy.org] INFO: Domain opened
+    2008/07/27 19:46 -0200 [scrapy-bot/scrapy.org] DEBUG: Crawled live <http://dev.scrapy.org/wiki/WikiStart> from <None>
+    2008/07/27 19:46 -0200 [scrapy-bot/scrapy.org] DEBUG: Crawled live <http://dev.scrapy.org/wiki/Starting> from <None>
+    2008/07/27 19:46 -0200 [scrapy-bot/scrapy.org] INFO: Domain closed (finished)
+    2008/07/27 19:46 -0200 [-] Main loop terminated.
+
+Pay attention to the lines labeled [scrapy-bot/scrapy.org], which corresponds to our spider identified by the domain "scrapy.org". You can see a log line for each url defined in *start_urls*. Because these urls are the starting ones, they have no referrers, and this condition is indicated at the end of the log line, where it says *from <None>*.
+
+But more interesting, as our *parse* method instructs, two files have been created: WikiStart and Starting, with the content of both urls.
+
+If you remember the elemental loop scheme of Scrapy described before:
+
+1. Feed the application with an initial set of urls.
+2. Create a Request object for each given url.
+3. Attach callback functions to each Request, so you define what to do with data once arrives.
+4. Feed the Execution Engine with a list of Requests and data.
+
+You will see that the sample spider we made here (and all spiders) actually performes step 1 and 3. But behind the scenes, step 2 and 4 are also carried out. Most spiders will explicitly create Requests. And they don't need to call the engine by themselves. In fact, they will never do that.
--- a/sites/scrapy.org/docs/sources/tutorial/tutorial2.rst
+++ b/sites/scrapy.org/docs/sources/tutorial/tutorial2.rst
@ -0,0 +1,173 @@
+=========
+Selectors
+=========
+
+In the last section we have learned how to make an spider, that is, the piece of code that actually does something with data directly retrieved from the web. But we want more than just save this data in a file. We want to extract and classify meaningful data for us. So, first we have to *parse* the data contained in the page. We could use for example regular expressions. And indeed, Scrapy has support for them.
+
+But web pages has a particular language structure, a markup language, that can be easily accessed by more suitable means. One way, and the one adopted by Scrapy, is xpath. Lets suppose we have the following html code::
+
+    <html>
+    <head><title>Commodities</title></head>
+    <body>
+    <table>
+
+    <tr>
+    <td>Gold</td>
+    <td class=price>939.12US$/oz</td>
+    </tr>
+    <tr>
+    <td>Oil</td>
+    <td class=price>123.44US$/bbl</td>
+    </tr>
+
+    </table>
+    </body>
+    </html>
+
+We can access to the title content by the xpath "/html/head/title", which, according to xpath specifications, it is evaluated as "<title>Commodities</title>". "/html/head/title/text()" is evaluated as "Commodities".
+
+If you want to access all the root td elements, the xpath expression is "//td", and the result is multivaluated: "<td>Gold</td>", "<td price=class>939.12US$/oz</td>", "<td>Oil</td>", "<td price=class>123.44US$/bbl</td>".
+
+If want to access all the first <td> tags inside its parent tag (in this case, <tr>), the xpath is "//td![1]", and the result is  multivalued "<td>Gold</td>", "<td>Oil</td>".
+
+Also, you can access tags by attributes. For example, "//td[@class='price']" will match "<td class="price">939.12US$/oz</td>" and "<td class="price">123.44US$/bbl</td>".
+
+The value of the xpath "//body" will be the entire content of the body tag, and so on. For an xpath specifications tutorial see http://www.w3schools.com/xpath/default.asp. I strongly recommend to read that tutorial before continue here. Usually we will not being using such simple xpath expressions.
+
+In order to manage with xpaths, Scrapy defines a class, [source:scrapy/trunk/scrapy/xpath/selector.py XPathSelector]. To access the different elements of a html code, you instantiate the class XPathSelector with a response object. XPathSelector comes in two flavors: XmlXPathSelector and HtmlXPathSelector. At this point I will introduce you a very useful tool: the scrapy console, *shell* so i can ilustrate the use of selectors. I strongly recommend to install IPython package to experience scrapy shell in the most convenient way. Save the above html code in a file, i.e. *sample.html* and run::
+
+    $ ./scrapy-ctl.py shell file://sample.html
+    Scrapy 0.1.0 - Interactive scraping console
+
+    Enabling Scrapy extensions... done
+    Downloading URL...            done
+    ------------------------------------------------------------------------------
+    Available local variables:
+       xxs: <class 'scrapy.xpath.selector.XmlXPathSelector'>
+       url: http://www.bloomberg.com/markets/commodities/cfutures.html
+       spider: <class 'scrapy.spider.models.BaseSpider'>
+       hxs: <class 'scrapy.xpath.selector.HtmlXPathSelector'>
+       item: <class 'scrapy.item.models.ScrapedItem'>
+       response: <class 'scrapy.http.response.Response'>
+    Available commands:
+       get <url>: Fetches an url and updates all variables.
+       scrapehelp: Prints this help.
+    ------------------------------------------------------------------------------
+    Python 2.5.2 (r252:60911, Apr 21 2008, 11:12:42) 
+    Type "copyright", "credits" or "license" for more information.
+
+    IPython 0.8.1 -- An enhanced Interactive Python.
+    ?       -> Introduction to IPython's features.
+    %magic  -> Information about IPython's 'magic' % functions.
+    help    -> Python's own help system.
+    object? -> Details about 'object'. ?object also works, ?? prints more.
+
+    In [1]:
+
+Ipython is an extended python console, and *shell* command adds some useful local variables and methods, sets python path and imports some scrapy libraries. One of the loaded variables is *response* and contains all the data associated with the result of the request action for the given url. If you enter *response.body.to_unicode()* the downloaded data will be printed on screen.
+
+*xxs* and *hxs* are selectors already instantiated with this response as initialization parameter. *xxs* is a selector with xml 'flavor', and hxs is a selector with html 'flavor'. In the present case, we will use *hxs* You can see selectors as objects that represents nodes in the document structure. So, these instantiated selectors are associated to the root node, or entire document.
+
+Selectors has three methods: *x* *re* and *extract*
+
+ - *x* returns a list of selectors, each of them representing the nodes getted in the xpath expression given as parameter.
+ - *re* returns a list of results of a regular expression given as parameter.
+ - *extract* actually extracts the data contained in the node. Does not receive parameters.
+
+A list of selectors, a XPathSelectorList object, has the same methods, but they are evaluated on each XPathSelector of the list.
+
+Examples:::
+
+    In [2]: hxs.x("/html")
+    Out[1]: [<HtmlXPathSelector (html) xpath=/html>]
+
+    In [2]: hxs.x("//td[@class='price']")
+    Out[2]: 
+    [<HtmlXPathSelector (td) xpath=//td[@class='price']>,
+     <HtmlXPathSelector (td) xpath=//td[@class='price']>]
+
+    In [3]: _.extract()
+    Out[3]: 
+    [u'<td class="price">939.12US$/oz</td>',
+     u'<td class="price">123.44US$/bbl</td>']
+
+    In [4]:hxs.re("\d+.\d+US\$")
+    Out[4]: [u'939.12US$', u'123.44US$']
+
+    In [5] hxs.x("//td[2]").re("\d+")
+    Out[5]: [u'939', u'12', u'123', u'44']
+
+This is a trivial example. But pages retrieved from the web won't be so simple. Lets suppose we are interested in financial data, and want to extract gold, oil and soy prices from Bloomberg Commodities Page http://www.bloomberg.com/markets/commodities/cfutures.html
+
+Lets try:::
+
+    $ ./scrapy-ctl.py shell http://www.bloomberg.com/markets/commodities/cfutures.html
+
+And inside the scrapy console:::
+
+    In [1]: response.body.to_unicode()
+    ...
+
+We will get a big text, and data is not easily targeted there. We need three prices, and can search by try and error, or by aproximation, but it is a tedious job. But someone has developed a very practical tool for doing this: [https://addons.mozilla.org/firefox/addon/1843 Firebug], an add-on for Mozilla Firefox. Install it, then open in your Firefox the Bloomberg url given above. Point the mouse over the Gold row, Price column, click right button and select Inspect Element. This new inspect option has been added by Firebug. A tab will open with the page source code, and highlighted, the code corresponding to the Gold Price:
+
+[[Image(firebug.png)]]
+
+(Observe that pointing the mouse over different elements in code tab, the matching rendered region in browser will be highlighted as well.)
+
+So, you could find the gold price by the xpath expression *span[@class='style5']"* If you type in scrapy console *hxs.x("//span[@class='style5']/text()").extract()* you will get a large list of nodes, because multiple instances of this xpath pattern were found, so you could select that corresponding to gold by the expression *hxs.x("//span[@class='style5']/text()").extract()![73]*
+
+Always there are several ways to target the same element. We could find the same gold price, for example, with the expression *hxs.x("//span[contains(text(),'GOLD')]/../following-sibling::td![1]/span/text()").extract()* And this is an interesting approach, because we could do:::
+
+    In [2]: def get_commodity_price(text):
+      ....:      return hxs.x("//span[contains(text(),'%s')]/../following-sibling::td[1]/span/text()" % text).extract()[0]
+      ....:
+
+and then,::
+
+    In [3]: get_commodity_price('GOLD')
+    Out[3]: u'916.500'
+
+    In [4]: get_commodity_price('WTI CRUDE')
+    Out[4]: u'121.590'
+
+    In [5]: get_commodity_price('SOYBEAN FUTURE')
+    Out[5]: u'1390.750'
+
+And so on.
+
+So, we can add the following spider to our project:::
+
+    from scrapy.spider import BaseSpider
+    from scrapy.xpath import HtmlXPathSelector
+
+    class MySpider(BaseSpider):
+        
+        domain_name = "bloomberg.com"
+
+        start_urls = ["http://www.bloomberg.com/markets/commodities/cfutures.html"]
+        
+        def parse(self, response):
+            
+            hxs = HtmlXPathSelector(response)
+            def get_commodity_price(text):
+                return hxs.x("//span[contains(text(),'%s')]/../following-sibling::td[1]/span/text()" % text).extract()[0]
+
+            print "Gold Futures NY: %sUS$/oz" % get_commodity_price("GOLD")
+            print "Oil WTI Futures: %sUS$/bbl" % get_commodity_price("WTI CRUDE")
+            print "Soybean: %sUS$/bu" % get_commodity_price("SOYBEAN FUTURE")
+            
+            return []
+
+    CRAWLER = MySpider()
+
+Save it as *spiders/bloomberg.py* for example, and enable it adding "bloomberg.com" to *conf/enabled_spiders.list* Actually the name given to the module that contains the spider does not matter. Scrapy will find the correct spider looking for its *domain_name* attribute.
+
+Do a crawl (*scrapy-ctl.py crawl* and both spiders will be ran. We will get printed in part of the log output something like this:::
+
+    2008/07/30 12:40 -0200 [HTTPPageGetter,client] Gold Futures NY: 905.400US$/oz
+    2008/07/30 12:40 -0200 [HTTPPageGetter,client] Oil WTI Futures: 121.210US$/bbl
+    2008/07/30 12:40 -0200 [HTTPPageGetter,client] Soybean: 1390.750US$/bu
+
+If you want to crawl only one domain among all availables, you just add the domain name to the command line arguments:::
+
+    ./scrapy-ctl.py crawl bloomberg.com
--- a/sites/scrapy.org/docs/sources/tutorial/tutorial3.rst
+++ b/sites/scrapy.org/docs/sources/tutorial/tutorial3.rst
@ -0,0 +1,173 @@
+==================
+Items and Adaptors
+==================
+
+At this point we have seen how to scrape data from a web page. But the only thing we did was to print it in the screen. Frequently we need to organize, process, or store the data we have retrieved. In order to perform these jobs, Scrapy model includes a data encapsulation class: [source:scrapy/trunk/scrapy/item/models.py ScrapedItem].
+
+A [source:scrapy/trunk/scrapy/item/models.py ScrapedItem] contains attributes with given values, just like any python object. And you use the method *attribute* in order to assign to the item an attribute with a given value and name. Of course, you could do something like::
+
+    index.name = "Gold Future"
+    index.value = get_commodity_price("GOLD")
+
+being *index* an instance of a given object class. But *attribute* method is intended to do more than just assign. Very often data from the web contains entities, tags, or comes inside a piece of text, and are always strings. But you may want to remove or not entities and tags depending on attribute name, or extract significative data from a bigger text (i.e. dimensions or prices inside a text), convert data to float, integers, and validate data assigned, or normalize some data features. You could do all this inside the spider code. But in this manner you will repeat lot of code, and even generate data inconsistencies from spider to spider simply because you forget to apply correct data processing according to policies you have pre-established (imagine when working with lots and lots of different spiders and source webpages)
+
+When you use the *attribute* method, the data passes through an adaptation pipeline before finally be assigned. The purpose of this adaptation pipeline is preciselly to apply all the data processing in a user-defined, simple and scalable way, and without having to implement it inside the spiders code each time you add a new one.
+
+Lets continue our bloomberg commodities example. Modify our bloomberg.py file as follows:::
+
+    from scrapy.spider import BaseSpider
+    from scrapy.xpath import HtmlXPathSelector
+
+    from scrapy.item import ScrapedItem as FinancyIndex
+
+    class BloombergSpider(BaseSpider):
+        
+        domain_name = "bloomberg.com"
+
+        start_urls = ["http://www.bloomberg.com/markets/commodities/cfutures.html",
+                      "http://www.bloomberg.com/markets/stocks/movers_index_spx.html"]
+        
+        def parse(self, response):
+
+            hxs = HtmlXPathSelector(response)
+            
+            def get_commodity_price(text):
+                value = hxs.x("//span[contains(text(),'%s')]/../following-sibling::td[1]/span/text()" % text).extract()[0]
+                unit = hxs.x("//span[contains(text(),'%s')]/following-sibling::span" % text).re("\((.*)\)")[0]
+                return value, unit
+            
+            def get_index_value():
+                return hxs.x("//span[contains(text(),'VALUE')]/../following-sibling::td/text()").extract()[0]
+
+            items = []
+            if "cfutures" in response.url:
+                for i in [("GOLD", "Gold Futures NY"), ("WTI CRUDE", "Oil WTI Futures"), ("SOYBEAN FUTURE", "Soybean")]:
+                    index = FinancyIndex()
+                    index.attribute("name", i[1])
+                    value, unit = get_commodity_price(i[0])
+                    index.attribute("value", value)
+                    index.attribute("unit", unit)
+                    items.append(index)
+            elif "index_spx" in response.url:
+                index = FinancyIndex()
+                index.attribute("name", "S&P500")
+                index.attribute("value", get_index_value())
+                index.attribute("unit", "%")
+                items.append(index)
+            return items
+
+    CRAWLER = BloombergSpider()
+
+Note two important differences with our previous version:
+
+* We have added a second url in our *start_urls* list. This means our spider will visit two webpages. In order to handle this, we have splitted our code in two parts, according to the response url, one that handles commodities page , and other which handles Standard&Poor's index page. We can add any ammount of urls we want to visit in the *start_urls* list. But this is an inefficient approach if we want to scrape lots of pages from a site (it is common to visit hundreds or even thousands of them). Further we will learn crawling techniques with Scrapy in order to generate and follow links in a site, beginning from few starting urls.
+* Instead of printing scraped data in the console, we assign it to [source:scrapy/trunk/scrapy/item/models.py ScrapedItem] objects. Each item contains three attributes: name, value and unit (also, we have added code to scrape units). Then, *parse* method returns a list of the scraped items. Remember we said in previous section that *parse* must always return a list.
+
+If we run the scrapy shell with our commodity url as parameter and run *spider.parse(response)* we will get a list of three items:::
+
+    In [1]: spider.parse(response)
+    Out[1]: 
+    [<scrapy.item.models.ScrapedItem object at 0x8ae602c>,
+     <scrapy.item.models.ScrapedItem object at 0x8ae616c>,
+     <scrapy.item.models.ScrapedItem object at 0x8ae606c>]
+
+So parse is doing what we expect: it returns three items, one for each index. If we want to see a more detailed output, *scrapy-ctl* proportionates another useful tool: the *parse* subcommand. Try:::
+
+    $ ./scrapy-ctl.py parse http://www.bloomberg.com/markets/commodities/cfutures.html
+    2008/09/02 17:40 -0200 [-] Log opened.
+    2008/09/02 17:40 -0200 [scrapy-bot] INFO: Enabled extensions: TelnetConsole, WebConsole
+    2008/09/02 17:40 -0200 [scrapy-bot] INFO: Enabled downloader middlewares: 
+    2008/09/02 17:40 -0200 [scrapy-bot] INFO: Enabled spider middlewares: 
+    2008/09/02 17:40 -0200 [scrapy-bot] INFO: Enabled item pipelines: 
+    2008/09/02 17:40 -0200 [scrapy-bot/bloomberg.com] INFO: Domain opened
+    2008/09/02 17:40 -0200 [scrapy-bot/bloomberg.com] DEBUG: Crawled live <http://www.bloomberg.com/markets/commodities/cfutures.html> from <None>
+    2008/09/02 17:40 -0200 [scrapy-bot/bloomberg.com] INFO: Domain closed (finished)
+    2008/09/02 17:40 -0200 [-] Main loop terminated.
+    # Scraped Items ------------------------------------------------------------
+    ScrapedItem({'name': 'Gold Futures NY', 'unit': u'USD/t oz.', 'value': u'812.700'})
+    ScrapedItem({'name': 'Oil WTI Futures', 'unit': u'USD/bbl.', 'value': u'110.430'})
+    ScrapedItem({'name': 'Soybean', 'unit': u'USd/bu.', 'value': u'1298.500'})
+
+    # Links --------------------------------------------------------------------
+
+    $ 
+
+We get a nice printing of our items, displaying all their attributes. Observe that attribute values are raw data, extracted as is from the page. All are strings, and we could be interested in operate with decimal values; units expressions are not homogeneous (see dollar symbols are USD and USd), and we did not give unicode strings as items name. So, lets construct an adaptor pipeline and create our own item class for this purpose:::
+
+    from decimal import Decimal
+    import re
+
+    from scrapy.item.models import ScrapedItem
+    from scrapy.item.adaptors import AdaptorPipe
+
+    def extract(value):
+        if hasattr(value, 'extract'):
+            value = value.extract()
+        if isinstance(value, list):
+            value = value[0]
+        return value
+
+    def to_decimal(value):
+        return Decimal(value)
+
+    def to_unicode(value):
+        return unicode(value)
+
+    _dollars_re = re.compile("[Uu][Ss][Dd]")
+    def normalize_units(value):
+        return _dollars_re.sub("U$S", value)
+
+    def clean(value):
+        return value.strip()
+
+    def clean_number(value):
+        if value.find(".") > value.find(","):
+            value = value.replace(",", "")
+        elif value.find(",") > value.find("."):
+            value = value.replace(".", "")
+        value = value.replace(",", ".")
+        return value
+
+    pipedict = {
+     'name': [extract, to_unicode, clean],
+     'unit': [extract, to_unicode, clean, normalize_units],
+     'value': [extract, to_unicode, clean, clean_number, to_decimal]
+    }
+
+    class FinancyIndex(ScrapedItem):
+
+        adaptors_pipe = AdaptorPipe(pipedict)
+
+
+Save this as *item.py* and in the bloomberg spider code, replace the line::
+
+    from scrapy.item import ScrapedItem as FinancyIndex
+
+by::
+
+    from financy.item import FinancyIndex
+
+Also, you can remove the *extract()* and *0]* in our helper functions *get_commodity_price* and *get_index_value*::
+
+    def get_commodity_price(text):
+        value = hxs.x("//span[contains(text(),'%s')]/../following-sibling::td[1]/span/text()" % text)
+        unit = hxs.x("//span[contains(text(),'%s')]/following-sibling::span" % text).re("\((.*)\)")
+        return value, unit
+    
+    def get_index_value():
+        return hxs.x("//span[contains(text(),'VALUE')]/../following-sibling::td/text()")v
+
+The extract adaptor will do that for us.
+
+Run *parse* command again:::
+
+    # Scraped Items ------------------------------------------------------------
+    FinancyIndex({'name': u'Gold Futures NY', 'unit': u'U$S/t oz.', 'value': Decimal("809.600")})
+    FinancyIndex({'name': u'Oil WTI Futures', 'unit': u'U$S/bbl.', 'value': Decimal("110.240")})
+    FinancyIndex({'name': u'Soybean', 'unit': u'U$S/bu.', 'value': Decimal("1298.500")})
+
+Very nice, uh? And this adaptor pipeline will be applied for all spiders you add to your project --which is the purpose of the adaptor pipeline--, provided you use the item *attribute* method to assign them.
+
+Adaptors run in the specified order. And you must take care that each adaptor receives from the previous one, what expects to receive. In order to enable an adaptors pipeline, you have to instantiate an [source:scrapy/trunk/scrapy/item/adaptors.py AdaptorPipe] class in your item class (in this example, *financy.item.FinancyIndex*) with a dictionary which maps *attribute name* to a pipeline of adaptation functions, and assign it to the class attribute *adaptors_pipe*.
+
+You can at anytime edit the pipeline by accessing to *pipe* attribute of the !AdaptorPipe instance. For example, you may want to remove or add an adaptor from a spider code for a certain group of attributes. But you must take on account that, because the pipeline is a class attribute, any change will has effect on all.
--- a/sites/scrapy.org/scrapyorg/docs/init.py
+++ b/sites/scrapy.org/scrapyorg/docs/init.py
--- a/sites/scrapy.org/scrapyorg/docs/models.py
+++ b/sites/scrapy.org/scrapyorg/docs/models.py
@ -0,0 +1,3 @@
+from django.db import models
+
+# Create your models here.
--- a/sites/scrapy.org/scrapyorg/docs/urls.py
+++ b/sites/scrapy.org/scrapyorg/docs/urls.py
@ -0,0 +1,18 @@
+from django.conf import settings
+from django.conf.urls.defaults import *
+
+from scrapyorg.docs.views import index, document
+
+
+urlpatterns = patterns('',
+    (r'^$', index),
+    (r'^(?P<url>[\w./-]*)/$', document),
+)
+
+if settings.DEBUG: # devel
+    urlpatterns += patterns('',         
+        (r'^%s/(?P<path>.*)$' % settings.MEDIA_URL[1:],
+          'django.views.static.serve',
+          {'document_root': settings.MEDIA_ROOT}),
+    )
+
--- a/sites/scrapy.org/scrapyorg/docs/views.py
+++ b/sites/scrapy.org/scrapyorg/docs/views.py
@ -0,0 +1,28 @@
+import cPickle as pickle
+import os
+
+from django.conf import settings
+from django.http import Http404
+from django.shortcuts import render_to_response
+from django.template import RequestContext
+
+
+def index(request):
+    return document(request, '')
+
+
+def document(request, url):
+    docroot = settings.DOC_PICKLE_ROOT
+
+    if os.path.exists(os.path.join(docroot, url, 'index.fpickle')):
+        docpath = os.path.join(docroot, url, 'index.fpickle')
+    elif os.path.exists(os.path.join(docroot, url + '.fpickle')):
+        docpath = os.path.join(docroot, url + '.fpickle')
+    else:
+        raise Http404("'%s' does not exist" % url)
+
+    docfile = open(docpath, 'rb')
+    doc = pickle.load(docfile)
+
+    return render_to_response('docs/doc.html', {'doc': doc},
+                              context_instance=RequestContext(request))
--- a/sites/scrapy.org/scrapyorg/settings.py
+++ b/sites/scrapy.org/scrapyorg/settings.py
@ -79,8 +79,11 @@ INSTALLED_APPS = (
    'django.contrib.markup',
    'scrapyorg.article',
    'scrapyorg.download',
+    'scrapyorg.docs',
 )

+DOC_PICKLE_ROOT = os.path.join(PROJECT_ROOT, 'docs', 'pickle')
+
 # Override previous settings with values in local_settings.py settings file.
 try:
    from local_settings import *
--- a/sites/scrapy.org/scrapyorg/urls.py
+++ b/sites/scrapy.org/scrapyorg/urls.py
@ -10,6 +10,9 @@ urlpatterns = patterns('',
    # admin
    url(r"^admin/download/downloadlink/", include("scrapyorg.download.urls")),
    url(r'^admin/(.*)', admin.site.root),
+
+    # docs
+    url(r"^docs/", include("scrapyorg.docs.urls")),
 )


--- a/sites/scrapy.org/static/style/pygments.css
+++ b/sites/scrapy.org/static/style/pygments.css
@ -0,0 +1,61 @@
+.hll { background-color: #ffffcc }
+.c { color: #408090; font-style: italic } /* Comment */
+.err { border: 1px solid #FF0000 } /* Error */
+.k { color: #007020; font-weight: bold } /* Keyword */
+.o { color: #666666 } /* Operator */
+.cm { color: #408090; font-style: italic } /* Comment.Multiline */
+.cp { color: #007020 } /* Comment.Preproc */
+.c1 { color: #408090; font-style: italic } /* Comment.Single */
+.cs { color: #408090; background-color: #fff0f0 } /* Comment.Special */
+.gd { color: #A00000 } /* Generic.Deleted */
+.ge { font-style: italic } /* Generic.Emph */
+.gr { color: #FF0000 } /* Generic.Error */
+.gh { color: #000080; font-weight: bold } /* Generic.Heading */
+.gi { color: #00A000 } /* Generic.Inserted */
+.go { color: #303030 } /* Generic.Output */
+.gp { color: #c65d09; font-weight: bold } /* Generic.Prompt */
+.gs { font-weight: bold } /* Generic.Strong */
+.gu { color: #800080; font-weight: bold } /* Generic.Subheading */
+.gt { color: #0040D0 } /* Generic.Traceback */
+.kc { color: #007020; font-weight: bold } /* Keyword.Constant */
+.kd { color: #007020; font-weight: bold } /* Keyword.Declaration */
+.kn { color: #007020; font-weight: bold } /* Keyword.Namespace */
+.kp { color: #007020 } /* Keyword.Pseudo */
+.kr { color: #007020; font-weight: bold } /* Keyword.Reserved */
+.kt { color: #902000 } /* Keyword.Type */
+.m { color: #208050 } /* Literal.Number */
+.s { color: #4070a0 } /* Literal.String */
+.na { color: #4070a0 } /* Name.Attribute */
+.nb { color: #007020 } /* Name.Builtin */
+.nc { color: #0e84b5; font-weight: bold } /* Name.Class */
+.no { color: #60add5 } /* Name.Constant */
+.nd { color: #555555; font-weight: bold } /* Name.Decorator */
+.ni { color: #d55537; font-weight: bold } /* Name.Entity */
+.ne { color: #007020 } /* Name.Exception */
+.nf { color: #06287e } /* Name.Function */
+.nl { color: #002070; font-weight: bold } /* Name.Label */
+.nn { color: #0e84b5; font-weight: bold } /* Name.Namespace */
+.nt { color: #062873; font-weight: bold } /* Name.Tag */
+.nv { color: #bb60d5 } /* Name.Variable */
+.ow { color: #007020; font-weight: bold } /* Operator.Word */
+.w { color: #bbbbbb } /* Text.Whitespace */
+.mf { color: #208050 } /* Literal.Number.Float */
+.mh { color: #208050 } /* Literal.Number.Hex */
+.mi { color: #208050 } /* Literal.Number.Integer */
+.mo { color: #208050 } /* Literal.Number.Oct */
+.sb { color: #4070a0 } /* Literal.String.Backtick */
+.sc { color: #4070a0 } /* Literal.String.Char */
+.sd { color: #4070a0; font-style: italic } /* Literal.String.Doc */
+.s2 { color: #4070a0 } /* Literal.String.Double */
+.se { color: #4070a0; font-weight: bold } /* Literal.String.Escape */
+.sh { color: #4070a0 } /* Literal.String.Heredoc */
+.si { color: #70a0d0; font-style: italic } /* Literal.String.Interpol */
+.sx { color: #c65d09 } /* Literal.String.Other */
+.sr { color: #235388 } /* Literal.String.Regex */
+.s1 { color: #4070a0 } /* Literal.String.Single */
+.ss { color: #517918 } /* Literal.String.Symbol */
+.bp { color: #007020 } /* Name.Builtin.Pseudo */
+.vc { color: #bb60d5 } /* Name.Variable.Class */
+.vg { color: #bb60d5 } /* Name.Variable.Global */
+.vi { color: #bb60d5 } /* Name.Variable.Instance */
+.il { color: #208050 } /* Literal.Number.Integer.Long */
--- a/sites/scrapy.org/templates/base_doc.html
+++ b/sites/scrapy.org/templates/base_doc.html
@ -0,0 +1,42 @@
+{% extends "base.html" %}
+
+{% block extrastyles %}
+<style type="text/css">
+a.headerlink {
+  color: #c60f0f;
+  font-size: 0.8em;
+  padding: 0 4px 0 4px;
+  text-decoration: none;
+  visibility: hidden;
+}
+
+h1:hover > a.headerlink,
+h2:hover > a.headerlink,
+h3:hover > a.headerlink,
+h4:hover > a.headerlink,
+h5:hover > a.headerlink,
+h6:hover > a.headerlink,
+dt:hover > a.headerlink {
+    visibility: visible;
+}
+
+a.headerlink:hover {
+  background-color: #c60f0f;
+  color: white;
+}
+</style>
+<link href="{{ MEDIA_URL }}/style/pygments.css" rel="stylesheet" type="text/css" media="screen" />
+{% endblock %}
+
+{% block content %}
+{% include "header.html" %}
+<div id="content">
+    <div id="left-column">
+    {% block main-content %}{% endblock %}
+    </div>
+
+    <div id="right-column">
+    {% block extra-content %}{% endblock %}
+    </div>
+</div>
+{% endblock %}
--- a/sites/scrapy.org/templates/docs/doc.html
+++ b/sites/scrapy.org/templates/docs/doc.html
@ -0,0 +1,37 @@
+{% extends "base_doc.html" %}
+
+{% block title %}{{ doc.title|safe }}{% endblock %}
+
+{% block main-content %}
+{{ doc.body|safe }}
+{% endblock %}
+
+{% block extra-content %}
+
+{% if doc.display_toc %}
+<h3>Contents:</h3>
+{{ doc.toc|safe }}
+{% endif %}
+
+<h3>Browse</h3>
+<ul>
+  {% if doc.prev %}
+  <li>Prev: <a href="{{ doc.prev.link }}">{{ doc.prev.title|safe }}</a></li>
+  {% endif %}
+  {% if doc.next %}
+  <li>Next: <a href="{{ doc.next.link }}">{{ doc.next.title|safe }}</a></li>
+  {% endif %}
+</ul>
+
+
+{% if doc.parents %}
+{% for p in doc.parents %}
+<ul><li><a href="{{ p.link }}">{{ p.title|safe }}</a>
+{% endfor %}
+<ul><li>{{ doc.title|safe }}</li></ul>
+{% for p in doc.parents %}</li></ul>{% endfor %}
+</li>
+</ul>
+{% endif %}
+
+{% endblock %}
--- a/sites/scrapy.org/templates/header.html
+++ b/sites/scrapy.org/templates/header.html
@ -4,7 +4,7 @@
    <li><a href="/">Home</a>
    <li><a href="/code/">Code</a>
    <li><a href="/blog/">Weblog</a>
-    <li><a href="/doc/">Documentation</a>
+    <li><a href="/docs/">Documentation</a>
    <li class="last"><a href="/download/">Download</a>
  </ul>
 </div>
				`@ -0,0 +1 @@`
				`€}q(Ugenindexcountsq]qUsplit_indexq‰Ugenindexentriesq]qUcurrent_page_nameqUgenindexqu.`