tutorial: lot of line wrapping and changes to double backticks instead of emphatized words

--HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40844
2025-02-26 22:24:24 +00:00 · 2009-02-12 04:38:13 +00:00 · 2009-02-12 04:38:13 +00:00 · 73d8177ecc
commit 73d8177ecc
parent 7c056c620e
4 changed files with 118 additions and 57 deletions
--- a/scrapy/trunk/docs/intro/tutorial/tutorial1.rst
+++ b/scrapy/trunk/docs/intro/tutorial/tutorial1.rst
@ -6,15 +6,19 @@ Creating a new project
 .. highlight:: sh
-In this tutorial, we'll teach you how to scrape http://www.google.com/dirhp Google's web directory.
+In this tutorial, we'll teach you how to scrape http://www.google.com/dirhp
 Google's web directory.
-We'll assume that Scrapy is already installed in your system, if not see :ref:`intro-install`.
+We'll assume that Scrapy is already installed in your system, if not see
 :ref:`intro-install`.
-For starting a new project, enter the directory where you'd like your project to be located, and run::
+For starting a new project, enter the directory where you'd like your project
 to be located, and run::
   $ scrapy-admin.py startproject google
-As long as Scrapy is well installed and the path is set, this will create a ``google`` directory with the following contents::
+As long as Scrapy is well installed and the path is set, this will create a
 ``google`` directory with the following contents::
   google/
       scrapy-ctl.py
@ -30,12 +34,22 @@ As long as Scrapy is well installed and the path is set, this will create a ``go
 These are basically:
-* ``scrapy-ctl.py``: the project's control script. It's used for running the different tasks (like "genspider", "crawl" and "parse"). We'll talk more about this later.
+* ``scrapy-ctl.py``: the project's control script. It's used for running the
  different tasks (like "genspider", "crawl" and "parse"). We'll talk more
  about this later.
 * ``google/``: the project's actual python module, you'll import your code from here.
 * ``google/items.py``: were you define the different kinds of items you're going to scrape.
 * ``google/pipelines.py``: were you define your item pipelines.
 * ``google/settings.py``: the project's settings file.
 * ``google/spiders/``: directory where you'll later place your spiders.
-* ``google/templates/``: directory containing some templates for newly created spiders, and where you can put your own.
+
 * ``google/templates/``: directory containing some templates for newly created
  spiders, and where you can put your own.
 Now you can continue with the next part of the tutorial: :ref:`intro-tutorial2`.
--- a/scrapy/trunk/docs/intro/tutorial/tutorial2.rst
+++ b/scrapy/trunk/docs/intro/tutorial/tutorial2.rst
@ -4,11 +4,13 @@
 Our first spider
 ================
-Ok, the time to write our first spider has come. Make sure that you're standing on your project's directory and run::
+Ok, the time to write our first spider has come. Make sure that you're standing
 on your project's directory and run::
    ./scrapy-ctl.py genspider google_directory google.com
-This should create a file called google_directory.py under the *google/spiders* directory looking like this::
+This should create a file called google_directory.py under ``google/spiders``
 directory looking like this::
    # -*- coding: utf8 -*-
    import re
@ -36,48 +38,73 @@ This should create a file called google_directory.py under the *google/spiders*
    SPIDER = GoogleDirectorySpider()
-| Now, let's explain a bit what this is all about.
+Now, let's explain a bit what this is all about.
 | As you may have noticed, the class that represents the spider is GoogleDirectorySpider, and it inherits from CrawlSpider.
 | This means that this spider will crawl over a website given some crawling rules, and parse the response you need according to your patterns, which are defined through the "rules" class attribute.
 | This attribute is nothing else but a tuple containing Rule objects. Each Rule defines a specific behaviour the spider will have while crawling the site.
 | Rule objects accept the following parameters (the ones between [ ] are optional):
-* *link_extractor* - A LinkExtractor instance, which defines the crawling patterns for this Rule.
+As you may have noticed, the class that represents the spider is
-* *[callback]* - A callback to be called for each link extracted matching the previous link extractor.
+GoogleDirectorySpider, and it inherits from CrawlSpider.
 * *[cb_kwargs]* - A dictionary of keyword arguments to be passed to the provided callback.
 * *[follow]* - A boolean that determines if links are going to be extracted from responses matching this Rule or not.
 * *[process_links]* - An optional callback for parsing the extracted links.
-| In this case, the Rule would mean something like "search for any url containing the string 'Items/', parse it with the 'parse_item' method, and try to extract more links from it".
+This means that this spider will crawl over a website given some crawling
-| Now, that's an example, so we must make our own Rule for our own spider.
+rules, and parse the response you need according to your patterns, which are
-| But before that, we must set our start_urls to our real entry point (which is not actually Google's homepage).
+defined through the "rules" class attribute.
 This attribute is nothing else but a tuple containing Rule objects. Each Rule
 defines a specific behaviour the spider will have while crawling the site.
 Rule objects accept the following parameters (the ones between [ ] are optional):
 * ``link_extractor`` - A LinkExtractor instance, which defines the crawling
  patterns for this Rule.
 * ``callback`` - A callback to be called for each link extracted matching the
  previous link extractor.
 * ``cb_kwargs`` - A dictionary of keyword arguments to be passed to the
  provided callback.
 * ``follow`` - A boolean that determines if links are going to be extracted
  from responses matching this Rule or not.
 * ``process_links`` - An optional callback for parsing the extracted links.
 In this case, the Rule would mean something like "search for any url containing
 the string 'Items/', parse it with the 'parse_item' method, and try to extract
 more links from it".
 Now, that's an example, so we must make our own Rule for our own spider.
 But before that, we must set our start_urls to our real entry point (which is
 not actually Google's homepage).
 So we replace that line with::
    start_urls = ['http://www.google.com/dirhp']
-Now it's the moment to surf that page, and see how can we do to extract data from it.
+Now it's the moment to surf that page, and see how can we do to extract data
-For this task is almost mandatory that you have Firefox FireBug extension, which allows you to browse through HTML markup in an easy and comfortable way. Otherwise you'd have
+from it.
 to search for tags manually through the body, which can be *very* tedious.
-|
+For this task is almost mandatory that you have Firefox FireBug extension,
-|
+which allows you to browse through HTML markup in an easy and comfortable way.
-|
+Otherwise you'd have to search for tags manually through the body, which can be
 *very* tedious.
 .. image:: scrot1.png
-What we see at first sight, is that the directory is divided in categories, which are also divided in subcategories.
+What we see at first sight, is that the directory is divided in categories,
-However, it seems as if there are more subcategories than the ones being shown in this page, so we'll keep looking...
+which are also divided in subcategories.
-|
+However, it seems as if there are more subcategories than the ones being shown
-|
+in this page, so we'll keep looking...
 |
 .. image:: scrot2.png
-| Right, this looks much more interesting. Not only subcategories themselves have more subcategories, but they have links to websites (which is in fact the purpose of the directory).
+Right, this looks much more interesting. Not only subcategories themselves have
-| Now, there's basically one thing to take into account about the previous, and it's the fact that apparently, categories urls are always of the
+more subcategories, but they have links to websites (which is in fact the
-  kind *http://www.google.com/Category/Subcategory/Another_Subcategory* (which is not very distinctive actually, but possible to use).
+purpose of the directory).
 Now, there's basically one thing to take into account about the previous, and
 it's the fact that apparently, categories urls are always of the kind
 http://www.google.com/Category/Subcategory/Another_Subcategory (which is not
 very distinctive actually, but possible to use).
 So, having said that, a possible rule set for the categories could be::
@ -88,8 +115,12 @@ So, having said that, a possible rule set for the categories could be::
        ),
    )
-| Basically, we told our Rule object to extract links that contain the string 'google.com/' plus any capital letter, plus any letter, the '_' character or the '/'.
+Basically, we told our Rule object to extract links that contain the string
-| Also, we set our callback 'parse_category' for each of those crawled links, and decided to extract more links from them with follow=True.
+'google.com/' plus any capital letter, plus any letter, the '_' character or
 the '/'.
 Also, we set our callback 'parse_category' for each of those crawled links, and
 decided to extract more links from them with follow=True.
 Until now, our spider would look something like::
@ -120,6 +151,7 @@ You can try crawling with this little code, by running::
    ./scrapy-ctl.py crawl google.com
-and it will actually work, altough it won't do any parsing, since parse_category is not defined, and that's exactly what we're going to do in the next part of
+and it will actually work, altough it won't do any parsing, since
-the tutorial: :ref:`intro-tutorial3`.
+parse_category is not defined, and that's exactly what we're going to do in the
 next part of the tutorial: :ref:`intro-tutorial3`.
--- a/scrapy/trunk/docs/intro/tutorial/tutorial3.rst
+++ b/scrapy/trunk/docs/intro/tutorial/tutorial3.rst
@ -34,27 +34,32 @@ name attributes, or anything that identifies the links uniquely), so the
 ranking bars could be a nice reference at the moment of selecting the desired
 area with an XPath expression.
-After using FireBug, we can see that each link is inside a *td* tag, which is
+After using FireBug, we can see that each link is inside a ``td`` tag, which is
-itself inside a *tr* tag that also contains the link's ranking bar (in another
+itself inside a ``tr`` tag that also contains the link's ranking bar (in another
-*td*).  So we could find the ranking bar; then from it, find its parent (the
+``td``).
-*tr*), and then finally, the link's *td* (which contains the data we want to
+
 So we could find the ranking bar; then from it, find its parent (the ``tr``),
 and then finally, the link's ``td`` (which contains the data we want to
 scrape).
 We loaded the page in the Scrapy shell (very useful for doing this), and tried
 an XPath expression in order to find the links, which actually worked.
-Basically, what that expression would mean is, "find any *td* tag who has a
+
-descendant tag *a* whose *href* attribute contains the string *#pagerank*" (the
+Basically, that expression would looks for the ranking bar's ``td`` tag:
-ranking bar's *td* tag), and then "return the *font* tag of each following *td*
+    "find any ``td`` tag who has a descendant tag ``a`` whose ``href``
-sibling that it has" (the link's *td* tag).
+    attribute contains the string ``#pagerank``"
 and then, the link's ``td`` tag:
    "return the ``font`` tag of each following ``td`` sibling that it has"
 Of course, this may not be the only way to get there (usually there are several
 expressions that get you to the same place), but it's quite good for this case.
-Another approach could be, for example, to find any *font* tags that have that
+Another approach could be, for example, to find any ``font`` tags that have that
 grey colour of the links, but I prefer to use the first one because it wouldn't
 be so strange if there were other tags with the same colour.
-Anyway, having said that, a possible *parse_category* could be::
+Anyway, having said that, a possible ``parse_category`` could be::
    def parse_category(self, response):
        # The selector we're going to use in order to extract data from the page
@ -82,37 +87,46 @@ Anyway, having said that, a possible *parse_category* could be::
 Okay, more new stuff here :) This time, items!
 Items
 ^^^^^
 Items are the objects we use to represent what you scrape (in this case,
 links).  Basically, there are two important things about items: attributes, and
 adaptors.
 Attributes
 """"""""""
 Attributes are nothing else but the places where you store the data you are
 extracting, which in this case are, the name of the linked website, its url,
 and a description.  Now, in most cases, you'll have to do certain modifications
 to this data in order to store it (or do whatever you want to do with it), and
 this is done through the adaptors.
 Adaptors
 """"""""
 Adaptors are basically a list of functions that receive a value, modify it (or
 not), and then return it.  In this case we used only two adaptors:
-* An extractor (*extract*), which, as you may imagine, extracts the data from
+* ``extract``, which, as you may imagine, extracts data from the XPath nodes
-  the XPath nodes you provide, and returns it in a list.
+  you provide, and returns it as a list.
-* *Delist*, which joins the list that the previous adaptor returned into a
+* ``Delist``, which joins the list that the previous adaptor returned into a
  string.  This adaptor itself is a class, and this is due to the fact that you
  must specify which delimiter will join the list. That's why we put an
  instance to this adaptor in the list.
-* *strip*, which (as you may imagine), does the same as the python strings
+* ``strip``, which (as you may imagine), does the same as the python strings
  strip method. Cleans up extra spaces before and after the provided string.
 In this case, we used the same adaptors for every attribute, because we're
 practically doing nothing to the data, just extracting it. But there might be
 situations were certain attributes are handled different than others (in fact,
-it *will* happen once you scrape more complicated sites with more complicated
+it will happen once you scrape more complicated sites with more complicated
 data).
-The rest of the code is quite self-explanatory. The *attribute* method sets the
+The rest of the code is quite self-explanatory. The ``attribute`` method sets the
 item's attributes, and the items themselves are put into a list that we'll
 return to Scrapy's engine.  One simple (although important) thing to remember
 here is that you must always return a list that contains either items,
--- a/scrapy/trunk/docs/intro/tutorial/tutorial4.rst
+++ b/scrapy/trunk/docs/intro/tutorial/tutorial4.rst
@ -12,7 +12,7 @@ case, we'll imagine that we want to save this data for storing it in a db
 later, or just to keep it there.
 To make it simple, we'll export the scraped items to a CSV file by making use
-of a useful function that Scrapy brings: *items_to_csv*.  This simple function
+of a useful function that Scrapy brings: ``items_to_csv``.  This simple function
 takes a file descriptor/filename, and a list of items, and writes their
 attributes to that file, in CSV format.
@ -73,4 +73,5 @@ link's name, description, and url to a file called 'scraped_items.csv'::
    ./scrapy-ctl.py crawl google.com
-This is the end of the tutorial. If you'd like to know more about Scrapy and its use, please read the rest of the documentation.
+This is the end of the tutorial. If you'd like to know more about Scrapy and
 its use, please read the rest of the documentation.