mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-26 14:44:08 +00:00
tutorial: lot of line wrapping and changes to double backticks instead of emphatized words
--HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40844
This commit is contained in:
parent
7c056c620e
commit
73d8177ecc
@ -6,15 +6,19 @@ Creating a new project
|
||||
|
||||
.. highlight:: sh
|
||||
|
||||
In this tutorial, we'll teach you how to scrape http://www.google.com/dirhp Google's web directory.
|
||||
In this tutorial, we'll teach you how to scrape http://www.google.com/dirhp
|
||||
Google's web directory.
|
||||
|
||||
We'll assume that Scrapy is already installed in your system, if not see :ref:`intro-install`.
|
||||
We'll assume that Scrapy is already installed in your system, if not see
|
||||
:ref:`intro-install`.
|
||||
|
||||
For starting a new project, enter the directory where you'd like your project to be located, and run::
|
||||
For starting a new project, enter the directory where you'd like your project
|
||||
to be located, and run::
|
||||
|
||||
$ scrapy-admin.py startproject google
|
||||
|
||||
As long as Scrapy is well installed and the path is set, this will create a ``google`` directory with the following contents::
|
||||
As long as Scrapy is well installed and the path is set, this will create a
|
||||
``google`` directory with the following contents::
|
||||
|
||||
google/
|
||||
scrapy-ctl.py
|
||||
@ -30,12 +34,22 @@ As long as Scrapy is well installed and the path is set, this will create a ``go
|
||||
|
||||
These are basically:
|
||||
|
||||
* ``scrapy-ctl.py``: the project's control script. It's used for running the different tasks (like "genspider", "crawl" and "parse"). We'll talk more about this later.
|
||||
* ``scrapy-ctl.py``: the project's control script. It's used for running the
|
||||
different tasks (like "genspider", "crawl" and "parse"). We'll talk more
|
||||
about this later.
|
||||
|
||||
* ``google/``: the project's actual python module, you'll import your code from here.
|
||||
|
||||
* ``google/items.py``: were you define the different kinds of items you're going to scrape.
|
||||
|
||||
* ``google/pipelines.py``: were you define your item pipelines.
|
||||
|
||||
* ``google/settings.py``: the project's settings file.
|
||||
|
||||
* ``google/spiders/``: directory where you'll later place your spiders.
|
||||
* ``google/templates/``: directory containing some templates for newly created spiders, and where you can put your own.
|
||||
|
||||
* ``google/templates/``: directory containing some templates for newly created
|
||||
spiders, and where you can put your own.
|
||||
|
||||
|
||||
Now you can continue with the next part of the tutorial: :ref:`intro-tutorial2`.
|
||||
|
@ -4,11 +4,13 @@
|
||||
Our first spider
|
||||
================
|
||||
|
||||
Ok, the time to write our first spider has come. Make sure that you're standing on your project's directory and run::
|
||||
Ok, the time to write our first spider has come. Make sure that you're standing
|
||||
on your project's directory and run::
|
||||
|
||||
./scrapy-ctl.py genspider google_directory google.com
|
||||
|
||||
This should create a file called google_directory.py under the *google/spiders* directory looking like this::
|
||||
This should create a file called google_directory.py under ``google/spiders``
|
||||
directory looking like this::
|
||||
|
||||
# -*- coding: utf8 -*-
|
||||
import re
|
||||
@ -36,48 +38,73 @@ This should create a file called google_directory.py under the *google/spiders*
|
||||
|
||||
SPIDER = GoogleDirectorySpider()
|
||||
|
||||
| Now, let's explain a bit what this is all about.
|
||||
| As you may have noticed, the class that represents the spider is GoogleDirectorySpider, and it inherits from CrawlSpider.
|
||||
| This means that this spider will crawl over a website given some crawling rules, and parse the response you need according to your patterns, which are defined through the "rules" class attribute.
|
||||
| This attribute is nothing else but a tuple containing Rule objects. Each Rule defines a specific behaviour the spider will have while crawling the site.
|
||||
| Rule objects accept the following parameters (the ones between [ ] are optional):
|
||||
Now, let's explain a bit what this is all about.
|
||||
|
||||
* *link_extractor* - A LinkExtractor instance, which defines the crawling patterns for this Rule.
|
||||
* *[callback]* - A callback to be called for each link extracted matching the previous link extractor.
|
||||
* *[cb_kwargs]* - A dictionary of keyword arguments to be passed to the provided callback.
|
||||
* *[follow]* - A boolean that determines if links are going to be extracted from responses matching this Rule or not.
|
||||
* *[process_links]* - An optional callback for parsing the extracted links.
|
||||
As you may have noticed, the class that represents the spider is
|
||||
GoogleDirectorySpider, and it inherits from CrawlSpider.
|
||||
|
||||
| In this case, the Rule would mean something like "search for any url containing the string 'Items/', parse it with the 'parse_item' method, and try to extract more links from it".
|
||||
| Now, that's an example, so we must make our own Rule for our own spider.
|
||||
| But before that, we must set our start_urls to our real entry point (which is not actually Google's homepage).
|
||||
This means that this spider will crawl over a website given some crawling
|
||||
rules, and parse the response you need according to your patterns, which are
|
||||
defined through the "rules" class attribute.
|
||||
|
||||
This attribute is nothing else but a tuple containing Rule objects. Each Rule
|
||||
defines a specific behaviour the spider will have while crawling the site.
|
||||
|
||||
Rule objects accept the following parameters (the ones between [ ] are optional):
|
||||
|
||||
* ``link_extractor`` - A LinkExtractor instance, which defines the crawling
|
||||
patterns for this Rule.
|
||||
|
||||
* ``callback`` - A callback to be called for each link extracted matching the
|
||||
previous link extractor.
|
||||
|
||||
* ``cb_kwargs`` - A dictionary of keyword arguments to be passed to the
|
||||
provided callback.
|
||||
|
||||
* ``follow`` - A boolean that determines if links are going to be extracted
|
||||
from responses matching this Rule or not.
|
||||
|
||||
* ``process_links`` - An optional callback for parsing the extracted links.
|
||||
|
||||
In this case, the Rule would mean something like "search for any url containing
|
||||
the string 'Items/', parse it with the 'parse_item' method, and try to extract
|
||||
more links from it".
|
||||
|
||||
Now, that's an example, so we must make our own Rule for our own spider.
|
||||
|
||||
But before that, we must set our start_urls to our real entry point (which is
|
||||
not actually Google's homepage).
|
||||
|
||||
So we replace that line with::
|
||||
|
||||
start_urls = ['http://www.google.com/dirhp']
|
||||
|
||||
Now it's the moment to surf that page, and see how can we do to extract data from it.
|
||||
For this task is almost mandatory that you have Firefox FireBug extension, which allows you to browse through HTML markup in an easy and comfortable way. Otherwise you'd have
|
||||
to search for tags manually through the body, which can be *very* tedious.
|
||||
Now it's the moment to surf that page, and see how can we do to extract data
|
||||
from it.
|
||||
|
||||
|
|
||||
|
|
||||
|
|
||||
For this task is almost mandatory that you have Firefox FireBug extension,
|
||||
which allows you to browse through HTML markup in an easy and comfortable way.
|
||||
Otherwise you'd have to search for tags manually through the body, which can be
|
||||
*very* tedious.
|
||||
|
||||
.. image:: scrot1.png
|
||||
|
||||
What we see at first sight, is that the directory is divided in categories, which are also divided in subcategories.
|
||||
However, it seems as if there are more subcategories than the ones being shown in this page, so we'll keep looking...
|
||||
What we see at first sight, is that the directory is divided in categories,
|
||||
which are also divided in subcategories.
|
||||
|
||||
|
|
||||
|
|
||||
|
|
||||
However, it seems as if there are more subcategories than the ones being shown
|
||||
in this page, so we'll keep looking...
|
||||
|
||||
.. image:: scrot2.png
|
||||
|
||||
| Right, this looks much more interesting. Not only subcategories themselves have more subcategories, but they have links to websites (which is in fact the purpose of the directory).
|
||||
| Now, there's basically one thing to take into account about the previous, and it's the fact that apparently, categories urls are always of the
|
||||
kind *http://www.google.com/Category/Subcategory/Another_Subcategory* (which is not very distinctive actually, but possible to use).
|
||||
Right, this looks much more interesting. Not only subcategories themselves have
|
||||
more subcategories, but they have links to websites (which is in fact the
|
||||
purpose of the directory).
|
||||
|
||||
Now, there's basically one thing to take into account about the previous, and
|
||||
it's the fact that apparently, categories urls are always of the kind
|
||||
http://www.google.com/Category/Subcategory/Another_Subcategory (which is not
|
||||
very distinctive actually, but possible to use).
|
||||
|
||||
So, having said that, a possible rule set for the categories could be::
|
||||
|
||||
@ -88,8 +115,12 @@ So, having said that, a possible rule set for the categories could be::
|
||||
),
|
||||
)
|
||||
|
||||
| Basically, we told our Rule object to extract links that contain the string 'google.com/' plus any capital letter, plus any letter, the '_' character or the '/'.
|
||||
| Also, we set our callback 'parse_category' for each of those crawled links, and decided to extract more links from them with follow=True.
|
||||
Basically, we told our Rule object to extract links that contain the string
|
||||
'google.com/' plus any capital letter, plus any letter, the '_' character or
|
||||
the '/'.
|
||||
|
||||
Also, we set our callback 'parse_category' for each of those crawled links, and
|
||||
decided to extract more links from them with follow=True.
|
||||
|
||||
Until now, our spider would look something like::
|
||||
|
||||
@ -120,6 +151,7 @@ You can try crawling with this little code, by running::
|
||||
|
||||
./scrapy-ctl.py crawl google.com
|
||||
|
||||
and it will actually work, altough it won't do any parsing, since parse_category is not defined, and that's exactly what we're going to do in the next part of
|
||||
the tutorial: :ref:`intro-tutorial3`.
|
||||
and it will actually work, altough it won't do any parsing, since
|
||||
parse_category is not defined, and that's exactly what we're going to do in the
|
||||
next part of the tutorial: :ref:`intro-tutorial3`.
|
||||
|
||||
|
@ -34,27 +34,32 @@ name attributes, or anything that identifies the links uniquely), so the
|
||||
ranking bars could be a nice reference at the moment of selecting the desired
|
||||
area with an XPath expression.
|
||||
|
||||
After using FireBug, we can see that each link is inside a *td* tag, which is
|
||||
itself inside a *tr* tag that also contains the link's ranking bar (in another
|
||||
*td*). So we could find the ranking bar; then from it, find its parent (the
|
||||
*tr*), and then finally, the link's *td* (which contains the data we want to
|
||||
After using FireBug, we can see that each link is inside a ``td`` tag, which is
|
||||
itself inside a ``tr`` tag that also contains the link's ranking bar (in another
|
||||
``td``).
|
||||
|
||||
So we could find the ranking bar; then from it, find its parent (the ``tr``),
|
||||
and then finally, the link's ``td`` (which contains the data we want to
|
||||
scrape).
|
||||
|
||||
We loaded the page in the Scrapy shell (very useful for doing this), and tried
|
||||
an XPath expression in order to find the links, which actually worked.
|
||||
Basically, what that expression would mean is, "find any *td* tag who has a
|
||||
descendant tag *a* whose *href* attribute contains the string *#pagerank*" (the
|
||||
ranking bar's *td* tag), and then "return the *font* tag of each following *td*
|
||||
sibling that it has" (the link's *td* tag).
|
||||
|
||||
Basically, that expression would looks for the ranking bar's ``td`` tag:
|
||||
"find any ``td`` tag who has a descendant tag ``a`` whose ``href``
|
||||
attribute contains the string ``#pagerank``"
|
||||
|
||||
and then, the link's ``td`` tag:
|
||||
"return the ``font`` tag of each following ``td`` sibling that it has"
|
||||
|
||||
Of course, this may not be the only way to get there (usually there are several
|
||||
expressions that get you to the same place), but it's quite good for this case.
|
||||
|
||||
Another approach could be, for example, to find any *font* tags that have that
|
||||
Another approach could be, for example, to find any ``font`` tags that have that
|
||||
grey colour of the links, but I prefer to use the first one because it wouldn't
|
||||
be so strange if there were other tags with the same colour.
|
||||
|
||||
Anyway, having said that, a possible *parse_category* could be::
|
||||
Anyway, having said that, a possible ``parse_category`` could be::
|
||||
|
||||
def parse_category(self, response):
|
||||
# The selector we're going to use in order to extract data from the page
|
||||
@ -82,37 +87,46 @@ Anyway, having said that, a possible *parse_category* could be::
|
||||
|
||||
Okay, more new stuff here :) This time, items!
|
||||
|
||||
Items
|
||||
^^^^^
|
||||
|
||||
Items are the objects we use to represent what you scrape (in this case,
|
||||
links). Basically, there are two important things about items: attributes, and
|
||||
adaptors.
|
||||
|
||||
Attributes
|
||||
""""""""""
|
||||
|
||||
Attributes are nothing else but the places where you store the data you are
|
||||
extracting, which in this case are, the name of the linked website, its url,
|
||||
and a description. Now, in most cases, you'll have to do certain modifications
|
||||
to this data in order to store it (or do whatever you want to do with it), and
|
||||
this is done through the adaptors.
|
||||
|
||||
Adaptors
|
||||
""""""""
|
||||
|
||||
Adaptors are basically a list of functions that receive a value, modify it (or
|
||||
not), and then return it. In this case we used only two adaptors:
|
||||
|
||||
* An extractor (*extract*), which, as you may imagine, extracts the data from
|
||||
the XPath nodes you provide, and returns it in a list.
|
||||
* ``extract``, which, as you may imagine, extracts data from the XPath nodes
|
||||
you provide, and returns it as a list.
|
||||
|
||||
* *Delist*, which joins the list that the previous adaptor returned into a
|
||||
* ``Delist``, which joins the list that the previous adaptor returned into a
|
||||
string. This adaptor itself is a class, and this is due to the fact that you
|
||||
must specify which delimiter will join the list. That's why we put an
|
||||
instance to this adaptor in the list.
|
||||
|
||||
* *strip*, which (as you may imagine), does the same as the python strings
|
||||
* ``strip``, which (as you may imagine), does the same as the python strings
|
||||
strip method. Cleans up extra spaces before and after the provided string.
|
||||
|
||||
In this case, we used the same adaptors for every attribute, because we're
|
||||
practically doing nothing to the data, just extracting it. But there might be
|
||||
situations were certain attributes are handled different than others (in fact,
|
||||
it *will* happen once you scrape more complicated sites with more complicated
|
||||
it will happen once you scrape more complicated sites with more complicated
|
||||
data).
|
||||
|
||||
The rest of the code is quite self-explanatory. The *attribute* method sets the
|
||||
The rest of the code is quite self-explanatory. The ``attribute`` method sets the
|
||||
item's attributes, and the items themselves are put into a list that we'll
|
||||
return to Scrapy's engine. One simple (although important) thing to remember
|
||||
here is that you must always return a list that contains either items,
|
||||
|
@ -12,7 +12,7 @@ case, we'll imagine that we want to save this data for storing it in a db
|
||||
later, or just to keep it there.
|
||||
|
||||
To make it simple, we'll export the scraped items to a CSV file by making use
|
||||
of a useful function that Scrapy brings: *items_to_csv*. This simple function
|
||||
of a useful function that Scrapy brings: ``items_to_csv``. This simple function
|
||||
takes a file descriptor/filename, and a list of items, and writes their
|
||||
attributes to that file, in CSV format.
|
||||
|
||||
@ -73,4 +73,5 @@ link's name, description, and url to a file called 'scraped_items.csv'::
|
||||
|
||||
./scrapy-ctl.py crawl google.com
|
||||
|
||||
This is the end of the tutorial. If you'd like to know more about Scrapy and its use, please read the rest of the documentation.
|
||||
This is the end of the tutorial. If you'd like to know more about Scrapy and
|
||||
its use, please read the rest of the documentation.
|
||||
|
Loading…
x
Reference in New Issue
Block a user