mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-26 20:44:04 +00:00
formatting changes and references to spiders added in the tutorial
--HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40809
This commit is contained in:
parent
2526e11be0
commit
7e6c9c2e25
@ -43,16 +43,16 @@ These are basically:
|
|||||||
* ``dmoz/spiders/``: a directory where you'll later put your spiders.
|
* ``dmoz/spiders/``: a directory where you'll later put your spiders.
|
||||||
* ``dmoz/templates/``: directory containing the spider's templates.
|
* ``dmoz/templates/``: directory containing the spider's templates.
|
||||||
|
|
||||||
The use of this files will be clarified throughout the tutorial, now let's go into
|
The use of this files will be clarified throughout the tutorial, now let's go
|
||||||
spiders.
|
into spiders.
|
||||||
|
|
||||||
Spiders
|
Spiders
|
||||||
=======
|
=======
|
||||||
|
|
||||||
Spiders are custom modules written by you, the user, to scrape information from
|
Spiders are custom modules written by you, the user, to scrape information from
|
||||||
a certain domain. Their duty is to feed the Scrapy engine with URLs to
|
a certain domain. Their duty is to feed the Scrapy engine with URLs to download,
|
||||||
download, and then parse the downloaded contents in the search for data or more
|
and then parse the downloaded contents in the search for data or more URLs to
|
||||||
URLs to follow.
|
follow.
|
||||||
|
|
||||||
They are the heart of a Scrapy project and where most part of the action takes
|
They are the heart of a Scrapy project and where most part of the action takes
|
||||||
place.
|
place.
|
||||||
@ -60,7 +60,6 @@ place.
|
|||||||
To create our first spider, save this code in a file named ``dmoz_spider.py``
|
To create our first spider, save this code in a file named ``dmoz_spider.py``
|
||||||
inside ``dmoz/spiders`` folder::
|
inside ``dmoz/spiders`` folder::
|
||||||
|
|
||||||
|
|
||||||
from scrapy.spider import BaseSpider
|
from scrapy.spider import BaseSpider
|
||||||
|
|
||||||
class OpenDirectorySpider(BaseSpider):
|
class OpenDirectorySpider(BaseSpider):
|
||||||
@ -82,15 +81,15 @@ inside ``dmoz/spiders`` folder::
|
|||||||
When creating spiders, be sure not to name them equal to the project's name
|
When creating spiders, be sure not to name them equal to the project's name
|
||||||
or you won't be able to import modules from your project in your spider!
|
or you won't be able to import modules from your project in your spider!
|
||||||
|
|
||||||
The first line imports the class BaseSpider. For the purpose of creating a
|
The first line imports the class :class:`scrapy.spider.BaseSpider`. For the
|
||||||
working spider, you must subclass BaseSpider, and then define the three main,
|
purpose of creating a working spider, you must subclass
|
||||||
mandatory, attributes:
|
:class:`scrapy.spider.BaseSpider`, and then define the three main, mandatory,
|
||||||
|
attributes:
|
||||||
|
|
||||||
* ``domain_name``: identifies the spider. It must be unique, that is, you can't
|
* ``domain_name``: identifies the spider. It must be unique, that is, you can't
|
||||||
set the same domain name for different spiders.
|
set the same domain name for different spiders.
|
||||||
|
|
||||||
* ``start_urls``: is a list
|
* ``start_urls``: is a list of URLs where the spider will begin to crawl from.
|
||||||
of URLs where the spider will begin to crawl from.
|
|
||||||
So, the first pages downloaded will be those listed here. The subsequent URLs
|
So, the first pages downloaded will be those listed here. The subsequent URLs
|
||||||
will be generated successively from data contained in the start URLs.
|
will be generated successively from data contained in the start URLs.
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user