2009-04-03 01:19:36 +00:00
|
|
|
.. _topics-shell:
|
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
============
|
|
|
|
Scrapy shell
|
|
|
|
============
|
2009-04-03 01:19:36 +00:00
|
|
|
|
|
|
|
The Scrapy shell is an interactive shell where you can try and debug your
|
|
|
|
scraping code very quickly, without having to run the spider. It's meant to be
|
|
|
|
used for testing data extraction code, but you can actually use it for testing
|
|
|
|
any kind of code as it is also a regular Python shell.
|
|
|
|
|
|
|
|
The shell is used for testing XPath expressions and see how they work and what
|
|
|
|
data they extract from the web pages you're trying to scrape. It allows you to
|
|
|
|
interactively test your XPaths while you're writing your spider, without having
|
|
|
|
to run the spider to test every change.
|
|
|
|
|
|
|
|
Once you get familiarized with the Scrapy shell you'll see that it's an
|
|
|
|
invaluable tool for developing and debugging your spiders.
|
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
If you have `IPython`_ installed, the Scrapy shell will use it (instead of the
|
|
|
|
standard Python console). The `IPython`_ console is a much more powerful and
|
|
|
|
provides smart auto-completion and colorized output, among other things.
|
2009-04-03 01:19:36 +00:00
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
We highly recommend you to install `IPython`_, specially if you're working on
|
|
|
|
Unix systems (where `IPython`_ excels). See the `IPython installation guide`_
|
|
|
|
for more info.
|
2009-04-03 01:19:36 +00:00
|
|
|
|
|
|
|
.. _IPython: http://ipython.scipy.org/
|
|
|
|
.. _IPython installation guide: http://ipython.scipy.org/doc/rel-0.9.1/html/install/index.html
|
|
|
|
|
|
|
|
Launch the shell
|
|
|
|
================
|
|
|
|
|
|
|
|
To launch the shell type::
|
|
|
|
|
|
|
|
scrapy-ctl.py shell <url>
|
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
Where the ``<url>`` is the URL you want to scrape.
|
2009-04-03 01:19:36 +00:00
|
|
|
|
|
|
|
Using the shell
|
|
|
|
===============
|
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
The Scrapy shell is just a regular Python console (or `IPython` shell if you
|
|
|
|
have it available) which provides some additional functions available by
|
|
|
|
default (as shortcuts):
|
2009-04-03 01:19:36 +00:00
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
Built-in Shortcuts
|
|
|
|
------------------
|
2009-04-03 01:19:36 +00:00
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
* ``shelp()`` - print a help with the list of available objects and shortcuts
|
2009-04-03 01:19:36 +00:00
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
* ``fetch(request_or_url)`` - fetch a new response from the given request or
|
|
|
|
URL and update all related objects accordingly.
|
2009-04-03 01:19:36 +00:00
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
* ``view(response)`` - open the given response in your local web browser, for
|
|
|
|
inspection. Note that this will generate a temporary file which won't be
|
|
|
|
removed automatically.
|
2009-04-03 01:19:36 +00:00
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
Built-in Objects
|
|
|
|
----------------
|
2009-04-03 01:19:36 +00:00
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
The Scrapy shell automatically creates some convenient objects from the
|
|
|
|
downloaded page, like the :class:`~scrapy.http.Response` object and the
|
|
|
|
:class:`~scrapy.selector.XPathSelector` objects (for both HTML and XML
|
|
|
|
content).
|
2009-04-03 01:19:36 +00:00
|
|
|
|
|
|
|
Those objects are:
|
|
|
|
|
|
|
|
* ``url`` - the URL being analyzed
|
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
* ``spider`` - the Spider which is known to handle the URL, or a
|
|
|
|
:class:`~scrapy.spider.BaseSpider` object if there is no spider is found for
|
|
|
|
the current URL
|
2009-04-03 01:19:36 +00:00
|
|
|
|
2009-04-03 04:13:21 +00:00
|
|
|
* ``request`` - a :class:`~scrapy.http.Request` object of the last fetched
|
2009-08-27 18:24:08 -03:00
|
|
|
page. You can modify this request using :meth:`~scrapy.http.Request.replace`
|
|
|
|
fetch a new request (without leaving the shell) using the ``fetch``
|
|
|
|
shortcut.
|
2009-04-03 01:19:36 +00:00
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
* ``response`` - a :class:`~scrapy.http.Response` object contaning the last
|
|
|
|
fetched page
|
2009-04-03 01:19:36 +00:00
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
* ``hxs`` - a :class:`~scrapy.selector.HtmlXPathSelector` object constructed
|
|
|
|
with the last response fetched
|
2009-04-03 01:19:36 +00:00
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
* ``xxs`` - a :class:`~scrapy.selector.XmlXPathSelector` object constructed
|
|
|
|
with the last response fetched
|
2009-04-03 01:19:36 +00:00
|
|
|
|
|
|
|
Example of shell session
|
|
|
|
========================
|
|
|
|
|
|
|
|
Here's an example of a typical shell session where we start by scraping the
|
2009-04-03 04:13:21 +00:00
|
|
|
http://scrapy.org page, and then proceed to scrape the http://slashdot.org
|
2009-08-27 18:24:08 -03:00
|
|
|
page. Finally, we modify the (Slashdot) request method to POST and re-fetch it
|
2009-04-03 04:13:21 +00:00
|
|
|
getting a HTTP 405 (method not allowed) error. We end the session by typing
|
2009-08-27 18:24:08 -03:00
|
|
|
Ctrl-D (in Unix systems) or Ctrl-Z in Windows.
|
2009-04-03 04:13:21 +00:00
|
|
|
|
|
|
|
Keep in mind that the data extracted here may not be the same when you try it,
|
|
|
|
as those pages are not static and could have changed by the time you test this.
|
|
|
|
The only purpose of this example is to get you familiarized with how the Scrapy
|
|
|
|
shell works.
|
2009-04-03 01:19:36 +00:00
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
First, we launch the shell::
|
2009-04-03 01:19:36 +00:00
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
python scrapy-ctl.py shell http://scrapy.org --nolog
|
|
|
|
|
|
|
|
Then, the shell fetches the url (using the Scrapy downloader) and prints the
|
|
|
|
list of available objects and some help::
|
2009-04-03 01:19:36 +00:00
|
|
|
|
2009-04-03 04:13:21 +00:00
|
|
|
Fetching <http://scrapy.org>...
|
2009-08-27 18:24:08 -03:00
|
|
|
Available objects
|
|
|
|
=================
|
2009-04-03 01:19:36 +00:00
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
xxs : <XmlXPathSelector (http://scrapy.org) xpath=None>
|
|
|
|
url : http://scrapy.org
|
|
|
|
request : <http://scrapy.org>
|
|
|
|
spider : <scrapy.spider.models.BaseSpider object at 0x2bed9d0>
|
|
|
|
hxs : <HtmlXPathSelector (http://scrapy.org) xpath=None>
|
|
|
|
item : Item()
|
|
|
|
response : <http://scrapy.org>
|
|
|
|
|
|
|
|
Available shortcuts
|
|
|
|
===================
|
|
|
|
|
|
|
|
shelp() : Prints this help.
|
|
|
|
fetch(req_or_url) : Fetch a new request or URL and update objects
|
|
|
|
view(response) : View response in a browser
|
2009-04-03 04:13:21 +00:00
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
Python 2.6.2 (release26-maint, Apr 19 2009, 01:58:18)
|
|
|
|
Type "help", "copyright", "credits" or "license" for more information.
|
2009-04-03 04:13:21 +00:00
|
|
|
|
2009-08-27 18:24:08 -03:00
|
|
|
>>>
|
|
|
|
|
|
|
|
After that, we can stary playing with the objects::
|
|
|
|
|
|
|
|
>>> hxs.select("//h2/text()").extract()[0]
|
|
|
|
u'Welcome to Scrapy'
|
|
|
|
>>> fetch("http://slashdot.org")
|
|
|
|
Fetching <http://slashdot.org>...
|
|
|
|
Done - use shelp() to see available objects
|
|
|
|
>>> hxs.select("//h2/text()").extract()
|
|
|
|
[u'News for nerds, stuff that matters']
|
|
|
|
>>> request = request.replace(method="POST")
|
|
|
|
>>> fetch(request)
|
|
|
|
Fetching <POST http://slashdot.org>...
|
|
|
|
2009-04-03 00:57:39-0300 [scrapybot] ERROR: Downloading <http://slashdot.org> from <None>: 405 Method Not Allowed
|
|
|
|
>>>
|
2009-04-03 04:13:21 +00:00
|
|
|
|