scrapy/sep/sep-017.trac

= SEP-017: Spider Contracts =

[[PageOutline(2-5,Contents)]]

||'''SEP:'''||17||
||'''Title:'''||Spider Contracts||
||'''Author:'''||Insophia Team||
||'''Created:'''||2010-06-10||
||'''Status'''||Draft||

== Introduction ==

The motivation for Spider Contracts is to build a lightweight mechanism for testing your spiders, and be able to run the tests quickly without having to wait for all the spider to run. It's partially based on the [http://en.wikipedia.org/wiki/Design_by_contract Design by contract]  approach (hence its name) where you define certain conditions that spider callbacks must met, and you give example testing pages.

== How it works ==

In the docstring of your spider callbacks, you write certain tags that define the spider contract. For example, the URL of a sample page for that callback, and what you expect to scrape from it.

Then you can run a command to check that the spider contracts are met.

== Contract examples ==

=== Example URL for simple callback ===

The {{{parse_product}}} callback must return items containing the fields given in {{{@scrapes}}}.

{{{
#!python
class ProductSpider(BaseSpider):

    def parse_product(self, response):
        """
        @url http://www.example.com/store/product.php?id=123
        @scrapes name, price, description
        """"
}}}

=== Chained callbacks ===

The following spider contains two callbacks, one for login to a site, and the other for scraping user profile info.

The contracts assert that the first callback returns a Request and the second one scrape {{{{user, name, email}}} fields.

{{{
#!python
class UserProfileSpider(BaseSpider):

    def parse_login_page(self, response):
        """
        @url http://www.example.com/login.php
        @returns_request
        """
        # returns Request with callback=self.parse_profile_page

    def parse_profile_page(self, response):
        """
        @after parse_login_page
        @scrapes user, name, email
        """"
        # ...
}}}

== Tags reference ==

Note that tags can also be extended by users, meaning that you can have your own custom contract tags in your Scrapy project.

||{{{@url}}}             || url of a sample page parsed by the callback ||
||{{{@after}}}           || the callback is called with the response generated by the specified callback ||
||{{{@scrapes}}}         || list of fields that must be present in the item(s) scraped by the callback ||
||{{{@returns_request}}} || the callback must return one (and only one) Request ||

Some tag constraints:

 * a callback cannot contain {{{@url}}} and {{{@after}}}

== Checking spider contracts ==

To check the contracts of a single spider:

{{{
scrapy-ctl.py check example.com
}}}

Or to check all spiders:

{{{
scrapy-ctl.py check
}}}

No need to wait for the whole spider to run.