mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-24 04:04:13 +00:00
91 lines
2.8 KiB
Plaintext
91 lines
2.8 KiB
Plaintext
= SEP-017: Spider Contracts =
|
|
|
|
[[PageOutline(2-5,Contents)]]
|
|
|
|
||'''SEP:'''||17||
|
|
||'''Title:'''||Spider Contracts||
|
|
||'''Author:'''||Insophia Team||
|
|
||'''Created:'''||2010-06-10||
|
|
||'''Status'''||Draft||
|
|
|
|
== Introduction ==
|
|
|
|
The motivation for Spider Contracts is to build a lightweight mechanism for testing your spiders, and be able to run the tests quickly without having to wait for all the spider to run. It's partially based on the [http://en.wikipedia.org/wiki/Design_by_contract Design by contract] approach (hence its name) where you define certain conditions that spider callbacks must met, and you give example testing pages.
|
|
|
|
== How it works ==
|
|
|
|
In the docstring of your spider callbacks, you write certain tags that define the spider contract. For example, the URL of a sample page for that callback, and what you expect to scrape from it.
|
|
|
|
Then you can run a command to check that the spider contracts are met.
|
|
|
|
== Contract examples ==
|
|
|
|
=== Example URL for simple callback ===
|
|
|
|
The {{{parse_product}}} callback must return items containing the fields given in {{{@scrapes}}}.
|
|
|
|
{{{
|
|
#!python
|
|
class ProductSpider(BaseSpider):
|
|
|
|
def parse_product(self, response):
|
|
"""
|
|
@url http://www.example.com/store/product.php?id=123
|
|
@scrapes name, price, description
|
|
""""
|
|
}}}
|
|
|
|
=== Chained callbacks ===
|
|
|
|
The following spider contains two callbacks, one for login to a site, and the other for scraping user profile info.
|
|
|
|
The contracts assert that the first callback returns a Request and the second one scrape {{{{user, name, email}}} fields.
|
|
|
|
{{{
|
|
#!python
|
|
class UserProfileSpider(BaseSpider):
|
|
|
|
def parse_login_page(self, response):
|
|
"""
|
|
@url http://www.example.com/login.php
|
|
@returns_request
|
|
"""
|
|
# returns Request with callback=self.parse_profile_page
|
|
|
|
def parse_profile_page(self, response):
|
|
"""
|
|
@after parse_login_page
|
|
@scrapes user, name, email
|
|
""""
|
|
# ...
|
|
}}}
|
|
|
|
== Tags reference ==
|
|
|
|
Note that tags can also be extended by users, meaning that you can have your own custom contract tags in your Scrapy project.
|
|
|
|
||{{{@url}}} || url of a sample page parsed by the callback ||
|
|
||{{{@after}}} || the callback is called with the response generated by the specified callback ||
|
|
||{{{@scrapes}}} || list of fields that must be present in the item(s) scraped by the callback ||
|
|
||{{{@returns_request}}} || the callback must return one (and only one) Request ||
|
|
|
|
Some tag constraints:
|
|
|
|
* a callback cannot contain {{{@url}}} and {{{@after}}}
|
|
|
|
== Checking spider contracts ==
|
|
|
|
To check the contracts of a single spider:
|
|
|
|
{{{
|
|
scrapy-ctl.py check example.com
|
|
}}}
|
|
|
|
Or to check all spiders:
|
|
|
|
{{{
|
|
scrapy-ctl.py check
|
|
}}}
|
|
|
|
No need to wait for the whole spider to run.
|