1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-26 22:24:24 +00:00

tutorial: lot of line wrapping and changes to double backticks instead of emphatized words

--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40844
This commit is contained in:
Daniel Grana 2009-02-12 04:38:13 +00:00
parent 7c056c620e
commit 73d8177ecc
4 changed files with 118 additions and 57 deletions

View File

@ -6,15 +6,19 @@ Creating a new project
.. highlight:: sh .. highlight:: sh
In this tutorial, we'll teach you how to scrape http://www.google.com/dirhp Google's web directory. In this tutorial, we'll teach you how to scrape http://www.google.com/dirhp
Google's web directory.
We'll assume that Scrapy is already installed in your system, if not see :ref:`intro-install`. We'll assume that Scrapy is already installed in your system, if not see
:ref:`intro-install`.
For starting a new project, enter the directory where you'd like your project to be located, and run:: For starting a new project, enter the directory where you'd like your project
to be located, and run::
$ scrapy-admin.py startproject google $ scrapy-admin.py startproject google
As long as Scrapy is well installed and the path is set, this will create a ``google`` directory with the following contents:: As long as Scrapy is well installed and the path is set, this will create a
``google`` directory with the following contents::
google/ google/
scrapy-ctl.py scrapy-ctl.py
@ -30,12 +34,22 @@ As long as Scrapy is well installed and the path is set, this will create a ``go
These are basically: These are basically:
* ``scrapy-ctl.py``: the project's control script. It's used for running the different tasks (like "genspider", "crawl" and "parse"). We'll talk more about this later. * ``scrapy-ctl.py``: the project's control script. It's used for running the
different tasks (like "genspider", "crawl" and "parse"). We'll talk more
about this later.
* ``google/``: the project's actual python module, you'll import your code from here. * ``google/``: the project's actual python module, you'll import your code from here.
* ``google/items.py``: were you define the different kinds of items you're going to scrape. * ``google/items.py``: were you define the different kinds of items you're going to scrape.
* ``google/pipelines.py``: were you define your item pipelines. * ``google/pipelines.py``: were you define your item pipelines.
* ``google/settings.py``: the project's settings file. * ``google/settings.py``: the project's settings file.
* ``google/spiders/``: directory where you'll later place your spiders. * ``google/spiders/``: directory where you'll later place your spiders.
* ``google/templates/``: directory containing some templates for newly created spiders, and where you can put your own.
* ``google/templates/``: directory containing some templates for newly created
spiders, and where you can put your own.
Now you can continue with the next part of the tutorial: :ref:`intro-tutorial2`. Now you can continue with the next part of the tutorial: :ref:`intro-tutorial2`.

View File

@ -4,11 +4,13 @@
Our first spider Our first spider
================ ================
Ok, the time to write our first spider has come. Make sure that you're standing on your project's directory and run:: Ok, the time to write our first spider has come. Make sure that you're standing
on your project's directory and run::
./scrapy-ctl.py genspider google_directory google.com ./scrapy-ctl.py genspider google_directory google.com
This should create a file called google_directory.py under the *google/spiders* directory looking like this:: This should create a file called google_directory.py under ``google/spiders``
directory looking like this::
# -*- coding: utf8 -*- # -*- coding: utf8 -*-
import re import re
@ -36,48 +38,73 @@ This should create a file called google_directory.py under the *google/spiders*
SPIDER = GoogleDirectorySpider() SPIDER = GoogleDirectorySpider()
| Now, let's explain a bit what this is all about. Now, let's explain a bit what this is all about.
| As you may have noticed, the class that represents the spider is GoogleDirectorySpider, and it inherits from CrawlSpider.
| This means that this spider will crawl over a website given some crawling rules, and parse the response you need according to your patterns, which are defined through the "rules" class attribute.
| This attribute is nothing else but a tuple containing Rule objects. Each Rule defines a specific behaviour the spider will have while crawling the site.
| Rule objects accept the following parameters (the ones between [ ] are optional):
* *link_extractor* - A LinkExtractor instance, which defines the crawling patterns for this Rule. As you may have noticed, the class that represents the spider is
* *[callback]* - A callback to be called for each link extracted matching the previous link extractor. GoogleDirectorySpider, and it inherits from CrawlSpider.
* *[cb_kwargs]* - A dictionary of keyword arguments to be passed to the provided callback.
* *[follow]* - A boolean that determines if links are going to be extracted from responses matching this Rule or not.
* *[process_links]* - An optional callback for parsing the extracted links.
| In this case, the Rule would mean something like "search for any url containing the string 'Items/', parse it with the 'parse_item' method, and try to extract more links from it". This means that this spider will crawl over a website given some crawling
| Now, that's an example, so we must make our own Rule for our own spider. rules, and parse the response you need according to your patterns, which are
| But before that, we must set our start_urls to our real entry point (which is not actually Google's homepage). defined through the "rules" class attribute.
This attribute is nothing else but a tuple containing Rule objects. Each Rule
defines a specific behaviour the spider will have while crawling the site.
Rule objects accept the following parameters (the ones between [ ] are optional):
* ``link_extractor`` - A LinkExtractor instance, which defines the crawling
patterns for this Rule.
* ``callback`` - A callback to be called for each link extracted matching the
previous link extractor.
* ``cb_kwargs`` - A dictionary of keyword arguments to be passed to the
provided callback.
* ``follow`` - A boolean that determines if links are going to be extracted
from responses matching this Rule or not.
* ``process_links`` - An optional callback for parsing the extracted links.
In this case, the Rule would mean something like "search for any url containing
the string 'Items/', parse it with the 'parse_item' method, and try to extract
more links from it".
Now, that's an example, so we must make our own Rule for our own spider.
But before that, we must set our start_urls to our real entry point (which is
not actually Google's homepage).
So we replace that line with:: So we replace that line with::
start_urls = ['http://www.google.com/dirhp'] start_urls = ['http://www.google.com/dirhp']
Now it's the moment to surf that page, and see how can we do to extract data from it. Now it's the moment to surf that page, and see how can we do to extract data
For this task is almost mandatory that you have Firefox FireBug extension, which allows you to browse through HTML markup in an easy and comfortable way. Otherwise you'd have from it.
to search for tags manually through the body, which can be *very* tedious.
| For this task is almost mandatory that you have Firefox FireBug extension,
| which allows you to browse through HTML markup in an easy and comfortable way.
| Otherwise you'd have to search for tags manually through the body, which can be
*very* tedious.
.. image:: scrot1.png .. image:: scrot1.png
What we see at first sight, is that the directory is divided in categories, which are also divided in subcategories. What we see at first sight, is that the directory is divided in categories,
However, it seems as if there are more subcategories than the ones being shown in this page, so we'll keep looking... which are also divided in subcategories.
| However, it seems as if there are more subcategories than the ones being shown
| in this page, so we'll keep looking...
|
.. image:: scrot2.png .. image:: scrot2.png
| Right, this looks much more interesting. Not only subcategories themselves have more subcategories, but they have links to websites (which is in fact the purpose of the directory). Right, this looks much more interesting. Not only subcategories themselves have
| Now, there's basically one thing to take into account about the previous, and it's the fact that apparently, categories urls are always of the more subcategories, but they have links to websites (which is in fact the
kind *http://www.google.com/Category/Subcategory/Another_Subcategory* (which is not very distinctive actually, but possible to use). purpose of the directory).
Now, there's basically one thing to take into account about the previous, and
it's the fact that apparently, categories urls are always of the kind
http://www.google.com/Category/Subcategory/Another_Subcategory (which is not
very distinctive actually, but possible to use).
So, having said that, a possible rule set for the categories could be:: So, having said that, a possible rule set for the categories could be::
@ -88,8 +115,12 @@ So, having said that, a possible rule set for the categories could be::
), ),
) )
| Basically, we told our Rule object to extract links that contain the string 'google.com/' plus any capital letter, plus any letter, the '_' character or the '/'. Basically, we told our Rule object to extract links that contain the string
| Also, we set our callback 'parse_category' for each of those crawled links, and decided to extract more links from them with follow=True. 'google.com/' plus any capital letter, plus any letter, the '_' character or
the '/'.
Also, we set our callback 'parse_category' for each of those crawled links, and
decided to extract more links from them with follow=True.
Until now, our spider would look something like:: Until now, our spider would look something like::
@ -120,6 +151,7 @@ You can try crawling with this little code, by running::
./scrapy-ctl.py crawl google.com ./scrapy-ctl.py crawl google.com
and it will actually work, altough it won't do any parsing, since parse_category is not defined, and that's exactly what we're going to do in the next part of and it will actually work, altough it won't do any parsing, since
the tutorial: :ref:`intro-tutorial3`. parse_category is not defined, and that's exactly what we're going to do in the
next part of the tutorial: :ref:`intro-tutorial3`.

View File

@ -34,27 +34,32 @@ name attributes, or anything that identifies the links uniquely), so the
ranking bars could be a nice reference at the moment of selecting the desired ranking bars could be a nice reference at the moment of selecting the desired
area with an XPath expression. area with an XPath expression.
After using FireBug, we can see that each link is inside a *td* tag, which is After using FireBug, we can see that each link is inside a ``td`` tag, which is
itself inside a *tr* tag that also contains the link's ranking bar (in another itself inside a ``tr`` tag that also contains the link's ranking bar (in another
*td*). So we could find the ranking bar; then from it, find its parent (the ``td``).
*tr*), and then finally, the link's *td* (which contains the data we want to
So we could find the ranking bar; then from it, find its parent (the ``tr``),
and then finally, the link's ``td`` (which contains the data we want to
scrape). scrape).
We loaded the page in the Scrapy shell (very useful for doing this), and tried We loaded the page in the Scrapy shell (very useful for doing this), and tried
an XPath expression in order to find the links, which actually worked. an XPath expression in order to find the links, which actually worked.
Basically, what that expression would mean is, "find any *td* tag who has a
descendant tag *a* whose *href* attribute contains the string *#pagerank*" (the Basically, that expression would looks for the ranking bar's ``td`` tag:
ranking bar's *td* tag), and then "return the *font* tag of each following *td* "find any ``td`` tag who has a descendant tag ``a`` whose ``href``
sibling that it has" (the link's *td* tag). attribute contains the string ``#pagerank``"
and then, the link's ``td`` tag:
"return the ``font`` tag of each following ``td`` sibling that it has"
Of course, this may not be the only way to get there (usually there are several Of course, this may not be the only way to get there (usually there are several
expressions that get you to the same place), but it's quite good for this case. expressions that get you to the same place), but it's quite good for this case.
Another approach could be, for example, to find any *font* tags that have that Another approach could be, for example, to find any ``font`` tags that have that
grey colour of the links, but I prefer to use the first one because it wouldn't grey colour of the links, but I prefer to use the first one because it wouldn't
be so strange if there were other tags with the same colour. be so strange if there were other tags with the same colour.
Anyway, having said that, a possible *parse_category* could be:: Anyway, having said that, a possible ``parse_category`` could be::
def parse_category(self, response): def parse_category(self, response):
# The selector we're going to use in order to extract data from the page # The selector we're going to use in order to extract data from the page
@ -82,37 +87,46 @@ Anyway, having said that, a possible *parse_category* could be::
Okay, more new stuff here :) This time, items! Okay, more new stuff here :) This time, items!
Items
^^^^^
Items are the objects we use to represent what you scrape (in this case, Items are the objects we use to represent what you scrape (in this case,
links). Basically, there are two important things about items: attributes, and links). Basically, there are two important things about items: attributes, and
adaptors. adaptors.
Attributes
""""""""""
Attributes are nothing else but the places where you store the data you are Attributes are nothing else but the places where you store the data you are
extracting, which in this case are, the name of the linked website, its url, extracting, which in this case are, the name of the linked website, its url,
and a description. Now, in most cases, you'll have to do certain modifications and a description. Now, in most cases, you'll have to do certain modifications
to this data in order to store it (or do whatever you want to do with it), and to this data in order to store it (or do whatever you want to do with it), and
this is done through the adaptors. this is done through the adaptors.
Adaptors
""""""""
Adaptors are basically a list of functions that receive a value, modify it (or Adaptors are basically a list of functions that receive a value, modify it (or
not), and then return it. In this case we used only two adaptors: not), and then return it. In this case we used only two adaptors:
* An extractor (*extract*), which, as you may imagine, extracts the data from * ``extract``, which, as you may imagine, extracts data from the XPath nodes
the XPath nodes you provide, and returns it in a list. you provide, and returns it as a list.
* *Delist*, which joins the list that the previous adaptor returned into a * ``Delist``, which joins the list that the previous adaptor returned into a
string. This adaptor itself is a class, and this is due to the fact that you string. This adaptor itself is a class, and this is due to the fact that you
must specify which delimiter will join the list. That's why we put an must specify which delimiter will join the list. That's why we put an
instance to this adaptor in the list. instance to this adaptor in the list.
* *strip*, which (as you may imagine), does the same as the python strings * ``strip``, which (as you may imagine), does the same as the python strings
strip method. Cleans up extra spaces before and after the provided string. strip method. Cleans up extra spaces before and after the provided string.
In this case, we used the same adaptors for every attribute, because we're In this case, we used the same adaptors for every attribute, because we're
practically doing nothing to the data, just extracting it. But there might be practically doing nothing to the data, just extracting it. But there might be
situations were certain attributes are handled different than others (in fact, situations were certain attributes are handled different than others (in fact,
it *will* happen once you scrape more complicated sites with more complicated it will happen once you scrape more complicated sites with more complicated
data). data).
The rest of the code is quite self-explanatory. The *attribute* method sets the The rest of the code is quite self-explanatory. The ``attribute`` method sets the
item's attributes, and the items themselves are put into a list that we'll item's attributes, and the items themselves are put into a list that we'll
return to Scrapy's engine. One simple (although important) thing to remember return to Scrapy's engine. One simple (although important) thing to remember
here is that you must always return a list that contains either items, here is that you must always return a list that contains either items,

View File

@ -12,7 +12,7 @@ case, we'll imagine that we want to save this data for storing it in a db
later, or just to keep it there. later, or just to keep it there.
To make it simple, we'll export the scraped items to a CSV file by making use To make it simple, we'll export the scraped items to a CSV file by making use
of a useful function that Scrapy brings: *items_to_csv*. This simple function of a useful function that Scrapy brings: ``items_to_csv``. This simple function
takes a file descriptor/filename, and a list of items, and writes their takes a file descriptor/filename, and a list of items, and writes their
attributes to that file, in CSV format. attributes to that file, in CSV format.
@ -73,4 +73,5 @@ link's name, description, and url to a file called 'scraped_items.csv'::
./scrapy-ctl.py crawl google.com ./scrapy-ctl.py crawl google.com
This is the end of the tutorial. If you'd like to know more about Scrapy and its use, please read the rest of the documentation. This is the end of the tutorial. If you'd like to know more about Scrapy and
its use, please read the rest of the documentation.