1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-26 18:23:57 +00:00

tutorial: lot of line wrapping and changes to double backticks instead of emphatized words

--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40844
This commit is contained in:
Daniel Grana 2009-02-12 04:38:13 +00:00
parent 7c056c620e
commit 73d8177ecc
4 changed files with 118 additions and 57 deletions

View File

@ -6,15 +6,19 @@ Creating a new project
.. highlight:: sh
In this tutorial, we'll teach you how to scrape http://www.google.com/dirhp Google's web directory.
In this tutorial, we'll teach you how to scrape http://www.google.com/dirhp
Google's web directory.
We'll assume that Scrapy is already installed in your system, if not see :ref:`intro-install`.
We'll assume that Scrapy is already installed in your system, if not see
:ref:`intro-install`.
For starting a new project, enter the directory where you'd like your project to be located, and run::
For starting a new project, enter the directory where you'd like your project
to be located, and run::
$ scrapy-admin.py startproject google
As long as Scrapy is well installed and the path is set, this will create a ``google`` directory with the following contents::
As long as Scrapy is well installed and the path is set, this will create a
``google`` directory with the following contents::
google/
scrapy-ctl.py
@ -30,12 +34,22 @@ As long as Scrapy is well installed and the path is set, this will create a ``go
These are basically:
* ``scrapy-ctl.py``: the project's control script. It's used for running the different tasks (like "genspider", "crawl" and "parse"). We'll talk more about this later.
* ``scrapy-ctl.py``: the project's control script. It's used for running the
different tasks (like "genspider", "crawl" and "parse"). We'll talk more
about this later.
* ``google/``: the project's actual python module, you'll import your code from here.
* ``google/items.py``: were you define the different kinds of items you're going to scrape.
* ``google/pipelines.py``: were you define your item pipelines.
* ``google/settings.py``: the project's settings file.
* ``google/spiders/``: directory where you'll later place your spiders.
* ``google/templates/``: directory containing some templates for newly created spiders, and where you can put your own.
* ``google/templates/``: directory containing some templates for newly created
spiders, and where you can put your own.
Now you can continue with the next part of the tutorial: :ref:`intro-tutorial2`.

View File

@ -4,11 +4,13 @@
Our first spider
================
Ok, the time to write our first spider has come. Make sure that you're standing on your project's directory and run::
Ok, the time to write our first spider has come. Make sure that you're standing
on your project's directory and run::
./scrapy-ctl.py genspider google_directory google.com
This should create a file called google_directory.py under the *google/spiders* directory looking like this::
This should create a file called google_directory.py under ``google/spiders``
directory looking like this::
# -*- coding: utf8 -*-
import re
@ -36,48 +38,73 @@ This should create a file called google_directory.py under the *google/spiders*
SPIDER = GoogleDirectorySpider()
| Now, let's explain a bit what this is all about.
| As you may have noticed, the class that represents the spider is GoogleDirectorySpider, and it inherits from CrawlSpider.
| This means that this spider will crawl over a website given some crawling rules, and parse the response you need according to your patterns, which are defined through the "rules" class attribute.
| This attribute is nothing else but a tuple containing Rule objects. Each Rule defines a specific behaviour the spider will have while crawling the site.
| Rule objects accept the following parameters (the ones between [ ] are optional):
Now, let's explain a bit what this is all about.
* *link_extractor* - A LinkExtractor instance, which defines the crawling patterns for this Rule.
* *[callback]* - A callback to be called for each link extracted matching the previous link extractor.
* *[cb_kwargs]* - A dictionary of keyword arguments to be passed to the provided callback.
* *[follow]* - A boolean that determines if links are going to be extracted from responses matching this Rule or not.
* *[process_links]* - An optional callback for parsing the extracted links.
As you may have noticed, the class that represents the spider is
GoogleDirectorySpider, and it inherits from CrawlSpider.
| In this case, the Rule would mean something like "search for any url containing the string 'Items/', parse it with the 'parse_item' method, and try to extract more links from it".
| Now, that's an example, so we must make our own Rule for our own spider.
| But before that, we must set our start_urls to our real entry point (which is not actually Google's homepage).
This means that this spider will crawl over a website given some crawling
rules, and parse the response you need according to your patterns, which are
defined through the "rules" class attribute.
This attribute is nothing else but a tuple containing Rule objects. Each Rule
defines a specific behaviour the spider will have while crawling the site.
Rule objects accept the following parameters (the ones between [ ] are optional):
* ``link_extractor`` - A LinkExtractor instance, which defines the crawling
patterns for this Rule.
* ``callback`` - A callback to be called for each link extracted matching the
previous link extractor.
* ``cb_kwargs`` - A dictionary of keyword arguments to be passed to the
provided callback.
* ``follow`` - A boolean that determines if links are going to be extracted
from responses matching this Rule or not.
* ``process_links`` - An optional callback for parsing the extracted links.
In this case, the Rule would mean something like "search for any url containing
the string 'Items/', parse it with the 'parse_item' method, and try to extract
more links from it".
Now, that's an example, so we must make our own Rule for our own spider.
But before that, we must set our start_urls to our real entry point (which is
not actually Google's homepage).
So we replace that line with::
start_urls = ['http://www.google.com/dirhp']
Now it's the moment to surf that page, and see how can we do to extract data from it.
For this task is almost mandatory that you have Firefox FireBug extension, which allows you to browse through HTML markup in an easy and comfortable way. Otherwise you'd have
to search for tags manually through the body, which can be *very* tedious.
Now it's the moment to surf that page, and see how can we do to extract data
from it.
|
|
|
For this task is almost mandatory that you have Firefox FireBug extension,
which allows you to browse through HTML markup in an easy and comfortable way.
Otherwise you'd have to search for tags manually through the body, which can be
*very* tedious.
.. image:: scrot1.png
What we see at first sight, is that the directory is divided in categories, which are also divided in subcategories.
However, it seems as if there are more subcategories than the ones being shown in this page, so we'll keep looking...
What we see at first sight, is that the directory is divided in categories,
which are also divided in subcategories.
|
|
|
However, it seems as if there are more subcategories than the ones being shown
in this page, so we'll keep looking...
.. image:: scrot2.png
| Right, this looks much more interesting. Not only subcategories themselves have more subcategories, but they have links to websites (which is in fact the purpose of the directory).
| Now, there's basically one thing to take into account about the previous, and it's the fact that apparently, categories urls are always of the
kind *http://www.google.com/Category/Subcategory/Another_Subcategory* (which is not very distinctive actually, but possible to use).
Right, this looks much more interesting. Not only subcategories themselves have
more subcategories, but they have links to websites (which is in fact the
purpose of the directory).
Now, there's basically one thing to take into account about the previous, and
it's the fact that apparently, categories urls are always of the kind
http://www.google.com/Category/Subcategory/Another_Subcategory (which is not
very distinctive actually, but possible to use).
So, having said that, a possible rule set for the categories could be::
@ -88,8 +115,12 @@ So, having said that, a possible rule set for the categories could be::
),
)
| Basically, we told our Rule object to extract links that contain the string 'google.com/' plus any capital letter, plus any letter, the '_' character or the '/'.
| Also, we set our callback 'parse_category' for each of those crawled links, and decided to extract more links from them with follow=True.
Basically, we told our Rule object to extract links that contain the string
'google.com/' plus any capital letter, plus any letter, the '_' character or
the '/'.
Also, we set our callback 'parse_category' for each of those crawled links, and
decided to extract more links from them with follow=True.
Until now, our spider would look something like::
@ -120,6 +151,7 @@ You can try crawling with this little code, by running::
./scrapy-ctl.py crawl google.com
and it will actually work, altough it won't do any parsing, since parse_category is not defined, and that's exactly what we're going to do in the next part of
the tutorial: :ref:`intro-tutorial3`.
and it will actually work, altough it won't do any parsing, since
parse_category is not defined, and that's exactly what we're going to do in the
next part of the tutorial: :ref:`intro-tutorial3`.

View File

@ -34,27 +34,32 @@ name attributes, or anything that identifies the links uniquely), so the
ranking bars could be a nice reference at the moment of selecting the desired
area with an XPath expression.
After using FireBug, we can see that each link is inside a *td* tag, which is
itself inside a *tr* tag that also contains the link's ranking bar (in another
*td*). So we could find the ranking bar; then from it, find its parent (the
*tr*), and then finally, the link's *td* (which contains the data we want to
After using FireBug, we can see that each link is inside a ``td`` tag, which is
itself inside a ``tr`` tag that also contains the link's ranking bar (in another
``td``).
So we could find the ranking bar; then from it, find its parent (the ``tr``),
and then finally, the link's ``td`` (which contains the data we want to
scrape).
We loaded the page in the Scrapy shell (very useful for doing this), and tried
an XPath expression in order to find the links, which actually worked.
Basically, what that expression would mean is, "find any *td* tag who has a
descendant tag *a* whose *href* attribute contains the string *#pagerank*" (the
ranking bar's *td* tag), and then "return the *font* tag of each following *td*
sibling that it has" (the link's *td* tag).
Basically, that expression would looks for the ranking bar's ``td`` tag:
"find any ``td`` tag who has a descendant tag ``a`` whose ``href``
attribute contains the string ``#pagerank``"
and then, the link's ``td`` tag:
"return the ``font`` tag of each following ``td`` sibling that it has"
Of course, this may not be the only way to get there (usually there are several
expressions that get you to the same place), but it's quite good for this case.
Another approach could be, for example, to find any *font* tags that have that
Another approach could be, for example, to find any ``font`` tags that have that
grey colour of the links, but I prefer to use the first one because it wouldn't
be so strange if there were other tags with the same colour.
Anyway, having said that, a possible *parse_category* could be::
Anyway, having said that, a possible ``parse_category`` could be::
def parse_category(self, response):
# The selector we're going to use in order to extract data from the page
@ -82,37 +87,46 @@ Anyway, having said that, a possible *parse_category* could be::
Okay, more new stuff here :) This time, items!
Items
^^^^^
Items are the objects we use to represent what you scrape (in this case,
links). Basically, there are two important things about items: attributes, and
adaptors.
Attributes
""""""""""
Attributes are nothing else but the places where you store the data you are
extracting, which in this case are, the name of the linked website, its url,
and a description. Now, in most cases, you'll have to do certain modifications
to this data in order to store it (or do whatever you want to do with it), and
this is done through the adaptors.
Adaptors
""""""""
Adaptors are basically a list of functions that receive a value, modify it (or
not), and then return it. In this case we used only two adaptors:
* An extractor (*extract*), which, as you may imagine, extracts the data from
the XPath nodes you provide, and returns it in a list.
* ``extract``, which, as you may imagine, extracts data from the XPath nodes
you provide, and returns it as a list.
* *Delist*, which joins the list that the previous adaptor returned into a
* ``Delist``, which joins the list that the previous adaptor returned into a
string. This adaptor itself is a class, and this is due to the fact that you
must specify which delimiter will join the list. That's why we put an
instance to this adaptor in the list.
* *strip*, which (as you may imagine), does the same as the python strings
* ``strip``, which (as you may imagine), does the same as the python strings
strip method. Cleans up extra spaces before and after the provided string.
In this case, we used the same adaptors for every attribute, because we're
practically doing nothing to the data, just extracting it. But there might be
situations were certain attributes are handled different than others (in fact,
it *will* happen once you scrape more complicated sites with more complicated
it will happen once you scrape more complicated sites with more complicated
data).
The rest of the code is quite self-explanatory. The *attribute* method sets the
The rest of the code is quite self-explanatory. The ``attribute`` method sets the
item's attributes, and the items themselves are put into a list that we'll
return to Scrapy's engine. One simple (although important) thing to remember
here is that you must always return a list that contains either items,

View File

@ -12,7 +12,7 @@ case, we'll imagine that we want to save this data for storing it in a db
later, or just to keep it there.
To make it simple, we'll export the scraped items to a CSV file by making use
of a useful function that Scrapy brings: *items_to_csv*. This simple function
of a useful function that Scrapy brings: ``items_to_csv``. This simple function
takes a file descriptor/filename, and a list of items, and writes their
attributes to that file, in CSV format.
@ -73,4 +73,5 @@ link's name, description, and url to a file called 'scraped_items.csv'::
./scrapy-ctl.py crawl google.com
This is the end of the tutorial. If you'd like to know more about Scrapy and its use, please read the rest of the documentation.
This is the end of the tutorial. If you'd like to know more about Scrapy and
its use, please read the rest of the documentation.