mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-23 20:23:56 +00:00
112 lines
3.0 KiB
ReStructuredText
112 lines
3.0 KiB
ReStructuredText
======= ================
|
|
SEP 17
|
|
Title Spider Contracts
|
|
Author Insophia Team
|
|
Created 2010-06-10
|
|
Status Draft
|
|
======= ================
|
|
|
|
=========================
|
|
SEP-017: Spider Contracts
|
|
=========================
|
|
|
|
The motivation for Spider Contracts is to build a lightweight mechanism for
|
|
testing your spiders, and be able to run the tests quickly without having to
|
|
wait for all the spider to run. It's partially based on the
|
|
[http://en.wikipedia.org/wiki/Design_by_contract Design by contract] approach
|
|
(hence its name) where you define certain conditions that spider callbacks must
|
|
met, and you give example testing pages.
|
|
|
|
How it works
|
|
============
|
|
|
|
In the docstring of your spider callbacks, you write certain tags that define
|
|
the spider contract. For example, the URL of a sample page for that callback,
|
|
and what you expect to scrape from it.
|
|
|
|
Then you can run a command to check that the spider contracts are met.
|
|
|
|
Contract examples
|
|
=================
|
|
|
|
gExample URL for simple callback
|
|
--------------------------------
|
|
|
|
The ``parse_product`` callback must return items containing the fields given in
|
|
``@scrapes``.
|
|
|
|
::
|
|
|
|
#!python
|
|
class ProductSpider(BaseSpider):
|
|
|
|
def parse_product(self, response):
|
|
"""
|
|
@url http://www.example.com/store/product.php?id=123
|
|
@scrapes name, price, description
|
|
""""
|
|
|
|
gChained callbacks
|
|
------------------
|
|
|
|
The following spider contains two callbacks, one for login to a site, and the
|
|
other for scraping user profile info.
|
|
|
|
The contracts assert that the first callback returns a Request and the second
|
|
one scrape ``user, name, email`` fields.
|
|
|
|
::
|
|
|
|
#!python
|
|
class UserProfileSpider(BaseSpider):
|
|
|
|
def parse_login_page(self, response):
|
|
"""
|
|
@url http://www.example.com/login.php
|
|
@returns_request
|
|
"""
|
|
# returns Request with callback=self.parse_profile_page
|
|
|
|
def parse_profile_page(self, response):
|
|
"""
|
|
@after parse_login_page
|
|
@scrapes user, name, email
|
|
""""
|
|
# ...
|
|
|
|
Tags reference
|
|
==============
|
|
|
|
Note that tags can also be extended by users, meaning that you can have your
|
|
own custom contract tags in your Scrapy project.
|
|
|
|
==================== ==========================================================
|
|
``@url`` url of a sample page parsed by the callback
|
|
``@after`` the callback is called with the response generated by the
|
|
specified callback
|
|
``@scrapes`` list of fields that must be present in the item(s) scraped
|
|
by the callback
|
|
``@returns_request`` the callback must return one (and only one) Request
|
|
==================== ==========================================================
|
|
|
|
Some tag constraints:
|
|
|
|
* a callback cannot contain ``@url`` and ``@after``
|
|
|
|
Checking spider contracts
|
|
=========================
|
|
|
|
To check the contracts of a single spider:
|
|
|
|
::
|
|
|
|
scrapy-ctl.py check example.com
|
|
|
|
Or to check all spiders:
|
|
|
|
::
|
|
|
|
scrapy-ctl.py check
|
|
|
|
No need to wait for the whole spider to run.
|