1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-26 16:44:22 +00:00

formatting changes and references to spiders added in the tutorial

--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40809
This commit is contained in:
Ismael Carnales 2009-01-30 19:14:16 +00:00
parent 2526e11be0
commit 7e6c9c2e25

View File

@ -43,16 +43,16 @@ These are basically:
* ``dmoz/spiders/``: a directory where you'll later put your spiders.
* ``dmoz/templates/``: directory containing the spider's templates.
The use of this files will be clarified throughout the tutorial, now let's go into
spiders.
The use of this files will be clarified throughout the tutorial, now let's go
into spiders.
Spiders
=======
Spiders are custom modules written by you, the user, to scrape information from
a certain domain. Their duty is to feed the Scrapy engine with URLs to
download, and then parse the downloaded contents in the search for data or more
URLs to follow.
a certain domain. Their duty is to feed the Scrapy engine with URLs to download,
and then parse the downloaded contents in the search for data or more URLs to
follow.
They are the heart of a Scrapy project and where most part of the action takes
place.
@ -60,7 +60,6 @@ place.
To create our first spider, save this code in a file named ``dmoz_spider.py``
inside ``dmoz/spiders`` folder::
from scrapy.spider import BaseSpider
class OpenDirectorySpider(BaseSpider):
@ -82,15 +81,15 @@ inside ``dmoz/spiders`` folder::
When creating spiders, be sure not to name them equal to the project's name
or you won't be able to import modules from your project in your spider!
The first line imports the class BaseSpider. For the purpose of creating a
working spider, you must subclass BaseSpider, and then define the three main,
mandatory, attributes:
The first line imports the class :class:`scrapy.spider.BaseSpider`. For the
purpose of creating a working spider, you must subclass
:class:`scrapy.spider.BaseSpider`, and then define the three main, mandatory,
attributes:
* ``domain_name``: identifies the spider. It must be unique, that is, you can't
set the same domain name for different spiders.
* ``start_urls``: is a list
of URLs where the spider will begin to crawl from.
* ``start_urls``: is a list of URLs where the spider will begin to crawl from.
So, the first pages downloaded will be those listed here. The subsequent URLs
will be generated successively from data contained in the start URLs.