scrapy/docs/topics/shell.rst

.. _topics-shell:

============
Scrapy shell
============

The Scrapy shell is an interactive shell where you can try and debug your
scraping code very quickly, without having to run the spider. It's meant to be
used for testing data extraction code, but you can actually use it for testing
any kind of code as it is also a regular Python shell.

The shell is used for testing XPath expressions and see how they work and what
data they extract from the web pages you're trying to scrape. It allows you to
interactively test your XPaths while you're writing your spider, without having
to run the spider to test every change.

Once you get familiarized with the Scrapy shell you'll see that it's an
invaluable tool for developing and debugging your spiders.

If you have `IPython`_ installed, the Scrapy shell will use it (instead of the
standard Python console). The `IPython`_ console is a much more powerful and
provides smart auto-completion and colorized output, among other things.

We highly recommend you to install `IPython`_, specially if you're working on
Unix systems (where `IPython`_ excels). See the `IPython installation guide`_
for more info.

.. _IPython: http://ipython.scipy.org/
.. _IPython installation guide: http://ipython.scipy.org/doc/rel-0.9.1/html/install/index.html

Launch the shell
================

To launch the shell type::

    scrapy-ctl.py shell <url>

Where the ``<url>`` is the URL you want to scrape.

Using the shell
===============

The Scrapy shell is just a regular Python console (or `IPython` shell if you
have it available) which provides some additional functions available by
default (as shortcuts):

Built-in Shortcuts
------------------

 * ``shelp()`` - print a help with the list of available objects and shortcuts

 * ``fetch(request_or_url)`` - fetch a new response from the given request or
   URL and update all related objects accordingly.

 * ``view(response)`` - open the given response in your local web browser, for
   inspection. Note that this will generate a temporary file which won't be
   removed automatically.

Built-in Objects
----------------

The Scrapy shell automatically creates some convenient objects from the
downloaded page, like the :class:`~scrapy.http.Response` object and the
:class:`~scrapy.selector.XPathSelector` objects (for both HTML and XML
content).

Those objects are:

 * ``url`` - the URL being analyzed

 * ``spider`` - the Spider which is known to handle the URL, or a
   :class:`~scrapy.spider.BaseSpider` object if there is no spider is found for
   the current URL

 * ``request`` - a :class:`~scrapy.http.Request` object of the last fetched
   page. You can modify this request using :meth:`~scrapy.http.Request.replace`
   fetch a new request (without leaving the shell) using the ``fetch``
   shortcut.

 * ``response`` - a :class:`~scrapy.http.Response` object contaning the last
   fetched page

 * ``hxs`` - a :class:`~scrapy.selector.HtmlXPathSelector` object constructed
   with the last response fetched

 * ``xxs`` - a :class:`~scrapy.selector.XmlXPathSelector` object constructed
   with the last response fetched

Example of shell session
========================

Here's an example of a typical shell session where we start by scraping the
http://scrapy.org page, and then proceed to scrape the http://slashdot.org
page. Finally, we modify the (Slashdot) request method to POST and re-fetch it
getting a HTTP 405 (method not allowed) error. We end the session by typing
Ctrl-D (in Unix systems) or Ctrl-Z in Windows.

Keep in mind that the data extracted here may not be the same when you try it,
as those pages are not static and could have changed by the time you test this.
The only purpose of this example is to get you familiarized with how the Scrapy
shell works.

First, we launch the shell::

    python scrapy-ctl.py shell http://scrapy.org --nolog

Then, the shell fetches the url (using the Scrapy downloader) and prints the
list of available objects and some help::

    Fetching <http://scrapy.org>...
    Available objects
    =================

      xxs       : <XmlXPathSelector (http://scrapy.org) xpath=None>
      url       : http://scrapy.org
      request   : <http://scrapy.org>
      spider    : <scrapy.spider.models.BaseSpider object at 0x2bed9d0>
      hxs       : <HtmlXPathSelector (http://scrapy.org) xpath=None>
      item      : Item()
      response  : <http://scrapy.org>

    Available shortcuts
    ===================

      shelp()           : Prints this help.
      fetch(req_or_url) : Fetch a new request or URL and update objects
      view(response)    : View response in a browser

    Python 2.6.2 (release26-maint, Apr 19 2009, 01:58:18) 
    Type "help", "copyright", "credits" or "license" for more information.

    >>>

After that, we can stary playing with the objects::

    >>> hxs.select("//h2/text()").extract()[0]
    u'Welcome to Scrapy'
    >>> fetch("http://slashdot.org")
    Fetching <http://slashdot.org>...
    Done - use shelp() to see available objects
    >>> hxs.select("//h2/text()").extract()
    [u'News for nerds, stuff that matters']
    >>> request = request.replace(method="POST")
    >>> fetch(request)
    Fetching <POST http://slashdot.org>...
    2009-04-03 00:57:39-0300 [scrapybot] ERROR: Downloading <http://slashdot.org> from <None>: 405 Method Not Allowed
    >>>
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00			`.. _topics-shell:`

refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			`============`
			`Scrapy shell`
			`============`
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
			`The Scrapy shell is an interactive shell where you can try and debug your`
			`scraping code very quickly, without having to run the spider. It's meant to be`
			`used for testing data extraction code, but you can actually use it for testing`
			`any kind of code as it is also a regular Python shell.`

			`The shell is used for testing XPath expressions and see how they work and what`
			`data they extract from the web pages you're trying to scrape. It allows you to`
			`interactively test your XPaths while you're writing your spider, without having`
			`to run the spider to test every change.`

			`Once you get familiarized with the Scrapy shell you'll see that it's an`
			`invaluable tool for developing and debugging your spiders.`

refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			If you have `IPython`_ installed, the Scrapy shell will use it (instead of the
			standard Python console). The `IPython`_ console is a much more powerful and
			`provides smart auto-completion and colorized output, among other things.`
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			We highly recommend you to install `IPython`_, specially if you're working on
			Unix systems (where `IPython`_ excels). See the `IPython installation guide`_
			`for more info.`
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
			`.. _IPython: http://ipython.scipy.org/`
			`.. _IPython installation guide: http://ipython.scipy.org/doc/rel-0.9.1/html/install/index.html`

			`Launch the shell`
			`================`

			`To launch the shell type::`

			`scrapy-ctl.py shell <url>`

refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			Where the ``<url>`` is the URL you want to scrape.
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
			`Using the shell`
			`===============`

refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			The Scrapy shell is just a regular Python console (or `IPython` shell if you
			`have it available) which provides some additional functions available by`
			`default (as shortcuts):`
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			`Built-in Shortcuts`
			`------------------`
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			* ``shelp()`` - print a help with the list of available objects and shortcuts
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			* ``fetch(request_or_url)`` - fetch a new response from the given request or
			`URL and update all related objects accordingly.`
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			* ``view(response)`` - open the given response in your local web browser, for
			`inspection. Note that this will generate a temporary file which won't be`
			`removed automatically.`
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			`Built-in Objects`
			`----------------`
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			`The Scrapy shell automatically creates some convenient objects from the`
			downloaded page, like the :class:`~scrapy.http.Response` object and the
			:class:`~scrapy.selector.XPathSelector` objects (for both HTML and XML
			`content).`
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
			`Those objects are:`

			* ``url`` - the URL being analyzed

refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			* ``spider`` - the Spider which is known to handle the URL, or a
			:class:`~scrapy.spider.BaseSpider` object if there is no spider is found for
			`the current URL`
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
more improvements to scrapy shell: added Request object, and support for modifying it and re-fetching it by issuing an empty 'get' command --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401037 2009-04-03 04:13:21 +00:00			* ``request`` - a :class:`~scrapy.http.Request` object of the last fetched
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			page. You can modify this request using :meth:`~scrapy.http.Request.replace`
			fetch a new request (without leaving the shell) using the ``fetch``
			`shortcut.`
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			* ``response`` - a :class:`~scrapy.http.Response` object contaning the last
			`fetched page`
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			* ``hxs`` - a :class:`~scrapy.selector.HtmlXPathSelector` object constructed
			`with the last response fetched`
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			* ``xxs`` - a :class:`~scrapy.selector.XmlXPathSelector` object constructed
			`with the last response fetched`
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
			`Example of shell session`
			`========================`

			`Here's an example of a typical shell session where we start by scraping the`
more improvements to scrapy shell: added Request object, and support for modifying it and re-fetching it by issuing an empty 'get' command --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401037 2009-04-03 04:13:21 +00:00			`http://scrapy.org page, and then proceed to scrape the http://slashdot.org`
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			`page. Finally, we modify the (Slashdot) request method to POST and re-fetch it`
more improvements to scrapy shell: added Request object, and support for modifying it and re-fetching it by issuing an empty 'get' command --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401037 2009-04-03 04:13:21 +00:00			`getting a HTTP 405 (method not allowed) error. We end the session by typing`
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			`Ctrl-D (in Unix systems) or Ctrl-Z in Windows.`
more improvements to scrapy shell: added Request object, and support for modifying it and re-fetching it by issuing an empty 'get' command --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401037 2009-04-03 04:13:21 +00:00
			`Keep in mind that the data extracted here may not be the same when you try it,`
			`as those pages are not static and could have changed by the time you test this.`
			`The only purpose of this example is to get you familiarized with how the Scrapy`
			`shell works.`
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			`First, we launch the shell::`
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			`python scrapy-ctl.py shell http://scrapy.org --nolog`

			`Then, the shell fetches the url (using the Scrapy downloader) and prints the`
			`list of available objects and some help::`
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
more improvements to scrapy shell: added Request object, and support for modifying it and re-fetching it by issuing an empty 'get' command --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401037 2009-04-03 04:13:21 +00:00			`Fetching <http://scrapy.org>...`
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			`Available objects`
			`=================`
added documentation for scrapy shell. closes #78 --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401032 2009-04-03 01:19:36 +00:00
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			`xxs : <XmlXPathSelector (http://scrapy.org) xpath=None>`
			`url : http://scrapy.org`
			`request : <http://scrapy.org>`
			`spider : <scrapy.spider.models.BaseSpider object at 0x2bed9d0>`
			`hxs : <HtmlXPathSelector (http://scrapy.org) xpath=None>`
			`item : Item()`
			`response : <http://scrapy.org>`

			`Available shortcuts`
			`===================`

			`shelp() : Prints this help.`
			`fetch(req_or_url) : Fetch a new request or URL and update objects`
			`view(response) : View response in a browser`
more improvements to scrapy shell: added Request object, and support for modifying it and re-fetching it by issuing an empty 'get' command --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401037 2009-04-03 04:13:21 +00:00
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			`Python 2.6.2 (release26-maint, Apr 19 2009, 01:58:18)`
			`Type "help", "copyright", "credits" or "license" for more information.`
more improvements to scrapy shell: added Request object, and support for modifying it and re-fetching it by issuing an empty 'get' command --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401037 2009-04-03 04:13:21 +00:00
refactored scrapy shell implementation, dropping IPython dependency, and adding a new 'view' shortcut 2009-08-27 18:24:08 -03:00			`>>>`

			`After that, we can stary playing with the objects::`

			`>>> hxs.select("//h2/text()").extract()[0]`
			`u'Welcome to Scrapy'`
			`>>> fetch("http://slashdot.org")`
			`Fetching <http://slashdot.org>...`
			`Done - use shelp() to see available objects`
			`>>> hxs.select("//h2/text()").extract()`
			`[u'News for nerds, stuff that matters']`
			`>>> request = request.replace(method="POST")`
			`>>> fetch(request)`
			`Fetching <POST http://slashdot.org>...`
			`2009-04-03 00:57:39-0300 [scrapybot] ERROR: Downloading <http://slashdot.org> from <None>: 405 Method Not Allowed`
			`>>>`
more improvements to scrapy shell: added Request object, and support for modifying it and re-fetching it by issuing an empty 'get' command --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401037 2009-04-03 04:13:21 +00:00