From 73d8177ecc80d7db8967a103e1ab16826c2ab3b8 Mon Sep 17 00:00:00 2001 From: Daniel Grana Date: Thu, 12 Feb 2009 04:38:13 +0000 Subject: [PATCH] tutorial: lot of line wrapping and changes to double backticks instead of emphatized words --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40844 --- .../trunk/docs/intro/tutorial/tutorial1.rst | 26 +++-- .../trunk/docs/intro/tutorial/tutorial2.rst | 98 ++++++++++++------- .../trunk/docs/intro/tutorial/tutorial3.rst | 46 ++++++--- .../trunk/docs/intro/tutorial/tutorial4.rst | 5 +- 4 files changed, 118 insertions(+), 57 deletions(-) diff --git a/scrapy/trunk/docs/intro/tutorial/tutorial1.rst b/scrapy/trunk/docs/intro/tutorial/tutorial1.rst index 312a432be..cbf3b929a 100644 --- a/scrapy/trunk/docs/intro/tutorial/tutorial1.rst +++ b/scrapy/trunk/docs/intro/tutorial/tutorial1.rst @@ -6,15 +6,19 @@ Creating a new project .. highlight:: sh -In this tutorial, we'll teach you how to scrape http://www.google.com/dirhp Google's web directory. +In this tutorial, we'll teach you how to scrape http://www.google.com/dirhp +Google's web directory. -We'll assume that Scrapy is already installed in your system, if not see :ref:`intro-install`. +We'll assume that Scrapy is already installed in your system, if not see +:ref:`intro-install`. -For starting a new project, enter the directory where you'd like your project to be located, and run:: +For starting a new project, enter the directory where you'd like your project +to be located, and run:: $ scrapy-admin.py startproject google -As long as Scrapy is well installed and the path is set, this will create a ``google`` directory with the following contents:: +As long as Scrapy is well installed and the path is set, this will create a +``google`` directory with the following contents:: google/ scrapy-ctl.py @@ -30,12 +34,22 @@ As long as Scrapy is well installed and the path is set, this will create a ``go These are basically: -* ``scrapy-ctl.py``: the project's control script. It's used for running the different tasks (like "genspider", "crawl" and "parse"). We'll talk more about this later. +* ``scrapy-ctl.py``: the project's control script. It's used for running the + different tasks (like "genspider", "crawl" and "parse"). We'll talk more + about this later. + * ``google/``: the project's actual python module, you'll import your code from here. + * ``google/items.py``: were you define the different kinds of items you're going to scrape. + * ``google/pipelines.py``: were you define your item pipelines. + * ``google/settings.py``: the project's settings file. + * ``google/spiders/``: directory where you'll later place your spiders. -* ``google/templates/``: directory containing some templates for newly created spiders, and where you can put your own. + +* ``google/templates/``: directory containing some templates for newly created + spiders, and where you can put your own. + Now you can continue with the next part of the tutorial: :ref:`intro-tutorial2`. diff --git a/scrapy/trunk/docs/intro/tutorial/tutorial2.rst b/scrapy/trunk/docs/intro/tutorial/tutorial2.rst index b9482c762..30218b4e0 100644 --- a/scrapy/trunk/docs/intro/tutorial/tutorial2.rst +++ b/scrapy/trunk/docs/intro/tutorial/tutorial2.rst @@ -4,11 +4,13 @@ Our first spider ================ -Ok, the time to write our first spider has come. Make sure that you're standing on your project's directory and run:: +Ok, the time to write our first spider has come. Make sure that you're standing +on your project's directory and run:: ./scrapy-ctl.py genspider google_directory google.com -This should create a file called google_directory.py under the *google/spiders* directory looking like this:: +This should create a file called google_directory.py under ``google/spiders`` +directory looking like this:: # -*- coding: utf8 -*- import re @@ -36,48 +38,73 @@ This should create a file called google_directory.py under the *google/spiders* SPIDER = GoogleDirectorySpider() -| Now, let's explain a bit what this is all about. -| As you may have noticed, the class that represents the spider is GoogleDirectorySpider, and it inherits from CrawlSpider. -| This means that this spider will crawl over a website given some crawling rules, and parse the response you need according to your patterns, which are defined through the "rules" class attribute. -| This attribute is nothing else but a tuple containing Rule objects. Each Rule defines a specific behaviour the spider will have while crawling the site. -| Rule objects accept the following parameters (the ones between [ ] are optional): +Now, let's explain a bit what this is all about. -* *link_extractor* - A LinkExtractor instance, which defines the crawling patterns for this Rule. -* *[callback]* - A callback to be called for each link extracted matching the previous link extractor. -* *[cb_kwargs]* - A dictionary of keyword arguments to be passed to the provided callback. -* *[follow]* - A boolean that determines if links are going to be extracted from responses matching this Rule or not. -* *[process_links]* - An optional callback for parsing the extracted links. +As you may have noticed, the class that represents the spider is +GoogleDirectorySpider, and it inherits from CrawlSpider. -| In this case, the Rule would mean something like "search for any url containing the string 'Items/', parse it with the 'parse_item' method, and try to extract more links from it". -| Now, that's an example, so we must make our own Rule for our own spider. -| But before that, we must set our start_urls to our real entry point (which is not actually Google's homepage). +This means that this spider will crawl over a website given some crawling +rules, and parse the response you need according to your patterns, which are +defined through the "rules" class attribute. + +This attribute is nothing else but a tuple containing Rule objects. Each Rule +defines a specific behaviour the spider will have while crawling the site. + +Rule objects accept the following parameters (the ones between [ ] are optional): + +* ``link_extractor`` - A LinkExtractor instance, which defines the crawling + patterns for this Rule. + +* ``callback`` - A callback to be called for each link extracted matching the + previous link extractor. + +* ``cb_kwargs`` - A dictionary of keyword arguments to be passed to the + provided callback. + +* ``follow`` - A boolean that determines if links are going to be extracted + from responses matching this Rule or not. + +* ``process_links`` - An optional callback for parsing the extracted links. + +In this case, the Rule would mean something like "search for any url containing +the string 'Items/', parse it with the 'parse_item' method, and try to extract +more links from it". + +Now, that's an example, so we must make our own Rule for our own spider. + +But before that, we must set our start_urls to our real entry point (which is +not actually Google's homepage). So we replace that line with:: start_urls = ['http://www.google.com/dirhp'] -Now it's the moment to surf that page, and see how can we do to extract data from it. -For this task is almost mandatory that you have Firefox FireBug extension, which allows you to browse through HTML markup in an easy and comfortable way. Otherwise you'd have -to search for tags manually through the body, which can be *very* tedious. +Now it's the moment to surf that page, and see how can we do to extract data +from it. -| -| -| +For this task is almost mandatory that you have Firefox FireBug extension, +which allows you to browse through HTML markup in an easy and comfortable way. +Otherwise you'd have to search for tags manually through the body, which can be +*very* tedious. .. image:: scrot1.png -What we see at first sight, is that the directory is divided in categories, which are also divided in subcategories. -However, it seems as if there are more subcategories than the ones being shown in this page, so we'll keep looking... +What we see at first sight, is that the directory is divided in categories, +which are also divided in subcategories. -| -| -| +However, it seems as if there are more subcategories than the ones being shown +in this page, so we'll keep looking... .. image:: scrot2.png -| Right, this looks much more interesting. Not only subcategories themselves have more subcategories, but they have links to websites (which is in fact the purpose of the directory). -| Now, there's basically one thing to take into account about the previous, and it's the fact that apparently, categories urls are always of the - kind *http://www.google.com/Category/Subcategory/Another_Subcategory* (which is not very distinctive actually, but possible to use). +Right, this looks much more interesting. Not only subcategories themselves have +more subcategories, but they have links to websites (which is in fact the +purpose of the directory). + +Now, there's basically one thing to take into account about the previous, and +it's the fact that apparently, categories urls are always of the kind +http://www.google.com/Category/Subcategory/Another_Subcategory (which is not +very distinctive actually, but possible to use). So, having said that, a possible rule set for the categories could be:: @@ -88,8 +115,12 @@ So, having said that, a possible rule set for the categories could be:: ), ) -| Basically, we told our Rule object to extract links that contain the string 'google.com/' plus any capital letter, plus any letter, the '_' character or the '/'. -| Also, we set our callback 'parse_category' for each of those crawled links, and decided to extract more links from them with follow=True. +Basically, we told our Rule object to extract links that contain the string +'google.com/' plus any capital letter, plus any letter, the '_' character or +the '/'. + +Also, we set our callback 'parse_category' for each of those crawled links, and +decided to extract more links from them with follow=True. Until now, our spider would look something like:: @@ -120,6 +151,7 @@ You can try crawling with this little code, by running:: ./scrapy-ctl.py crawl google.com -and it will actually work, altough it won't do any parsing, since parse_category is not defined, and that's exactly what we're going to do in the next part of -the tutorial: :ref:`intro-tutorial3`. +and it will actually work, altough it won't do any parsing, since +parse_category is not defined, and that's exactly what we're going to do in the +next part of the tutorial: :ref:`intro-tutorial3`. diff --git a/scrapy/trunk/docs/intro/tutorial/tutorial3.rst b/scrapy/trunk/docs/intro/tutorial/tutorial3.rst index dba6192fb..a387be646 100644 --- a/scrapy/trunk/docs/intro/tutorial/tutorial3.rst +++ b/scrapy/trunk/docs/intro/tutorial/tutorial3.rst @@ -34,27 +34,32 @@ name attributes, or anything that identifies the links uniquely), so the ranking bars could be a nice reference at the moment of selecting the desired area with an XPath expression. -After using FireBug, we can see that each link is inside a *td* tag, which is -itself inside a *tr* tag that also contains the link's ranking bar (in another -*td*). So we could find the ranking bar; then from it, find its parent (the -*tr*), and then finally, the link's *td* (which contains the data we want to +After using FireBug, we can see that each link is inside a ``td`` tag, which is +itself inside a ``tr`` tag that also contains the link's ranking bar (in another +``td``). + +So we could find the ranking bar; then from it, find its parent (the ``tr``), +and then finally, the link's ``td`` (which contains the data we want to scrape). We loaded the page in the Scrapy shell (very useful for doing this), and tried an XPath expression in order to find the links, which actually worked. -Basically, what that expression would mean is, "find any *td* tag who has a -descendant tag *a* whose *href* attribute contains the string *#pagerank*" (the -ranking bar's *td* tag), and then "return the *font* tag of each following *td* -sibling that it has" (the link's *td* tag). + +Basically, that expression would looks for the ranking bar's ``td`` tag: + "find any ``td`` tag who has a descendant tag ``a`` whose ``href`` + attribute contains the string ``#pagerank``" + +and then, the link's ``td`` tag: + "return the ``font`` tag of each following ``td`` sibling that it has" Of course, this may not be the only way to get there (usually there are several expressions that get you to the same place), but it's quite good for this case. -Another approach could be, for example, to find any *font* tags that have that +Another approach could be, for example, to find any ``font`` tags that have that grey colour of the links, but I prefer to use the first one because it wouldn't be so strange if there were other tags with the same colour. -Anyway, having said that, a possible *parse_category* could be:: +Anyway, having said that, a possible ``parse_category`` could be:: def parse_category(self, response): # The selector we're going to use in order to extract data from the page @@ -82,37 +87,46 @@ Anyway, having said that, a possible *parse_category* could be:: Okay, more new stuff here :) This time, items! +Items +^^^^^ + Items are the objects we use to represent what you scrape (in this case, links). Basically, there are two important things about items: attributes, and adaptors. +Attributes +"""""""""" + Attributes are nothing else but the places where you store the data you are extracting, which in this case are, the name of the linked website, its url, and a description. Now, in most cases, you'll have to do certain modifications to this data in order to store it (or do whatever you want to do with it), and this is done through the adaptors. +Adaptors +"""""""" + Adaptors are basically a list of functions that receive a value, modify it (or not), and then return it. In this case we used only two adaptors: -* An extractor (*extract*), which, as you may imagine, extracts the data from - the XPath nodes you provide, and returns it in a list. +* ``extract``, which, as you may imagine, extracts data from the XPath nodes + you provide, and returns it as a list. -* *Delist*, which joins the list that the previous adaptor returned into a +* ``Delist``, which joins the list that the previous adaptor returned into a string. This adaptor itself is a class, and this is due to the fact that you must specify which delimiter will join the list. That's why we put an instance to this adaptor in the list. -* *strip*, which (as you may imagine), does the same as the python strings +* ``strip``, which (as you may imagine), does the same as the python strings strip method. Cleans up extra spaces before and after the provided string. In this case, we used the same adaptors for every attribute, because we're practically doing nothing to the data, just extracting it. But there might be situations were certain attributes are handled different than others (in fact, -it *will* happen once you scrape more complicated sites with more complicated +it will happen once you scrape more complicated sites with more complicated data). -The rest of the code is quite self-explanatory. The *attribute* method sets the +The rest of the code is quite self-explanatory. The ``attribute`` method sets the item's attributes, and the items themselves are put into a list that we'll return to Scrapy's engine. One simple (although important) thing to remember here is that you must always return a list that contains either items, diff --git a/scrapy/trunk/docs/intro/tutorial/tutorial4.rst b/scrapy/trunk/docs/intro/tutorial/tutorial4.rst index a0e835858..d8fac6dd1 100644 --- a/scrapy/trunk/docs/intro/tutorial/tutorial4.rst +++ b/scrapy/trunk/docs/intro/tutorial/tutorial4.rst @@ -12,7 +12,7 @@ case, we'll imagine that we want to save this data for storing it in a db later, or just to keep it there. To make it simple, we'll export the scraped items to a CSV file by making use -of a useful function that Scrapy brings: *items_to_csv*. This simple function +of a useful function that Scrapy brings: ``items_to_csv``. This simple function takes a file descriptor/filename, and a list of items, and writes their attributes to that file, in CSV format. @@ -73,4 +73,5 @@ link's name, description, and url to a file called 'scraped_items.csv':: ./scrapy-ctl.py crawl google.com -This is the end of the tutorial. If you'd like to know more about Scrapy and its use, please read the rest of the documentation. +This is the end of the tutorial. If you'd like to know more about Scrapy and +its use, please read the rest of the documentation.