1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-25 01:23:44 +00:00
scrapy/docs/intro/tutorial.rst
2016-09-22 11:04:45 -03:00

675 lines
26 KiB
ReStructuredText

.. _intro-tutorial:
===============
Scrapy Tutorial
===============
In this tutorial, we'll assume that Scrapy is already installed on your system.
If that's not the case, see :ref:`intro-install`.
We are going to scrape `quotes.toscrape.com <http://quotes.toscrape.com/>`_, a website
that lists quotes from famous authors.
This tutorial will walk you through these tasks:
1. Creating a new Scrapy project
2. Writing a :ref:`spider <topics-spiders>` to crawl a site and extract data
3. Exporting the scraped data using the command line
4. Changing spider to recursively follow links
5. Using spider arguments
Scrapy is written in Python_. If you're new to the language you might want to
start by getting an idea of what the language is like, to get the most out of
Scrapy.
If you're already familiar with other languages, and want to learn Python
quickly, we recommend reading through `Dive Into Python 3`_. Alternatively,
you can follow the `Python Tutorial`_.
If you're new to programming and want to start with Python, you may find useful
the online book `Learn Python The Hard Way`_. You can also take a look at `this
list of Python resources for non-programmers`_.
.. _Python: https://www.python.org/
.. _this list of Python resources for non-programmers: https://wiki.python.org/moin/BeginnersGuide/NonProgrammers
.. _Dive Into Python 3: http://www.diveintopython3.net
.. _Python Tutorial: https://docs.python.org/3/tutorial
.. _Learn Python The Hard Way: http://learnpythonthehardway.org/book/
Creating a project
==================
Before you start scraping, you will have to set up a new Scrapy project. Enter a
directory where you'd like to store your code and run::
scrapy startproject tutorial
This will create a ``tutorial`` directory with the following contents::
tutorial/
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
Our first Spider
================
Spiders are classes that you define and that Scrapy uses to scrape information
from a website (or a group of websites). They must subclass
:class:`scrapy.Spider` and define the initial requests to make, optionally how
to follow links in the pages, and how to parse the downloaded page content to
extract data.
This is the code for our first Spider. Save it in a file named
``quotes_spider.py`` under the ``tutorial/spiders`` directory in your project::
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
As you can see, our Spider subclasses :class:`scrapy.Spider <scrapy.spiders.Spider>`
and defines some attributes and methods:
* :attr:`~scrapy.spiders.Spider.name`: identifies the Spider. It must be
unique within a project, that is, you can't set the same name for different
Spiders.
* :meth:`~scrapy.spiders.Spider.start_requests`: must return an iterable of
Requests (you can return a list of requests or write a generator function)
which the Spider will begin to crawl from. Subsequent requests will be
generated successively from these initial requests.
* :meth:`~scrapy.spiders.Spider.parse`: a method that will be called to handle
the response downloaded for each of the requests made. The response parameter
is an instance of :class:`~scrapy.http.TextResponse` that holds
the page content and has further helpful methods to handle it.
The :meth:`~scrapy.spiders.Spider.parse` method usually parses the response, extracting
the scraped data as dicts and also finding new URLs to
follow and creating new requests (:class:`~scrapy.http.Request`) from them.
How to run our spider
---------------------
To put our spider to work, go to the project's top level directory and run::
scrapy crawl quotes
This command runs the spider with name ``quotes`` that we've just added, that
will send some requests for the ``quotes.toscrape.com`` domain. You will get an output
similar to this::
... (omitted for brevity)
2016-09-20 14:48:00 [scrapy] INFO: Spider opened
2016-09-20 14:48:00 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-09-20 14:48:00 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-09-20 14:48:00 [scrapy] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2016-09-20 14:48:00 [scrapy] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2016-09-20 14:48:01 [quotes] DEBUG: Saved file quotes-1.html
2016-09-20 14:48:01 [scrapy] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2016-09-20 14:48:01 [quotes] DEBUG: Saved file quotes-2.html
2016-09-20 14:48:01 [scrapy] INFO: Closing spider (finished)
...
Now, check the files in the current directory. You should notice that two new
files have been created: *quotes-1.html* and *quotes-2.html*, with the content
for the respective URLs, as our ``parse`` method instructs.
.. note:: If you are wondering why we haven't parsed the HTML yet, hold
on, we will cover that soon.
What just happened under the hood?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Scrapy schedules the :class:`scrapy.Request <scrapy.http.Request>` objects
returned by the ``start_requests`` method of the Spider. Upon receiving a
response for each one, it instantiates :class:`~scrapy.http.Response` objects
and calls the callback method associated with the request (in this case, the
``parse`` method) passing the response as argument.
A shortcut to the start_requests method
---------------------------------------
Instead of implementing a :meth:`~scrapy.spiders.Spider.start_requests` method
that generates :class:`scrapy.Request <scrapy.http.Request>` objects from URLs,
you can just define a :attr:`~scrapy.spiders.Spider.start_urls` class attribute
with a list of URLs. This list will then be used by the default implementation
of :meth:`~scrapy.spiders.Spider.start_requests` to create the initial requests
for your spider::
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
The :meth:`~scrapy.spiders.Spider.parse` method will be called to handle each
of the requests for those URLs, even though we haven't explicitly told Scrapy
to do so. This happens because :meth:`~scrapy.spiders.Spider.parse` is Scrapy's
default callback method, which is called for requests without an explicitly
assigned callback.
Extracting data
---------------
The best way to learn how to extract data with Scrapy is trying selectors
using the shell :ref:`Scrapy shell <topics-shell>`. Run::
scrapy shell 'http://quotes.toscrape.com/page/1/'
.. note::
Remember to always enclose urls in quotes when running Scrapy shell from
command-line, otherwise urls containing arguments (ie. ``&`` character)
will not work.
You will see something like::
[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1/>
[s] response <200 http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x7fa91d888c10>
[s] spider <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
>>>
Using the shell, you can try selecting elements using `CSS`_ with the response
object::
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
The result of running ``response.css('title')`` is a list-like object called
:class:`~scrapy.selector.SelectorList`, which represents a list of
:class:`~scrapy.selector.Selector` objects that wrap around XML/HTML elements
and allow you to run further queries to fine-grain the selection or extract the
data.
To extract the text from the title above, you can do::
>>> response.css('title::text').extract()
['Quotes to Scrape']
There are two things to note here: one is that we've added ``::text`` to the
CSS query, to mean we want to select only the text elements directly inside
``<title>`` element. If we don't specify ``::text``, we'd get the full title
element, including its tags::
>>> response.css('title').extract()
['<title>Quotes to Scrape</title>']
The other thing is that the result of calling ``.extract()`` is a list, because
we're dealing with an instance of :class:`~scrapy.selector.SelectorList`. When
you know you just want the first result, as in this case, you can do::
>>> response.css('title::text').extract_first()
'Quotes to Scrape'
As an alternative, you could've written::
>>> response.css('title::text')[0].extract()
'Quotes to Scrape'
However, using ``.extract_first()`` avoids an ``IndexError`` and returns
``None`` when it doesn't find any element matching the selection.
There's a lesson here: for most scraping code, you want it to be resilient to
errors due to things not being found on a page, so that even if some parts fail
to be scraped, you can at least get **some** data.
Besides the :meth:`~scrapy.selector.Selector.extract` and
:meth:`~scrapy.selector.SelectorList.extract_first` methods, you can also use
the :meth:`~scrapy.selector.Selector.re` method to extract using `regular
expressions`::
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']
In order to find the proper CSS selectors to use, you might find useful opening
the response page from the shell in your web browser using ``view(response)``.
You can use your browser developer tools or extensions like Firebug (see
sections about :ref:`topics-firebug` and :ref:`topics-firefox`).
`Selector Gadget`_ is also a nice tool to quickly find CSS selector for
visually selected elements, which works in many browsers.
.. _regular expressions: https://docs.python.org/3/library/re.html
.. _Selector Gadget: http://selectorgadget.com/
XPath: a brief intro
^^^^^^^^^^^^^^^^^^^^
Besides `CSS`_, Scrapy selectors also support using `XPath`_ expressions::
>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'
XPath expressions are very powerful, and are the foundation of Scrapy
Selectors. In fact, CSS selectors are converted to XPath under-the-hood. You
can see that if you read closely the text representation of the selector
objects in the shell.
While perhaps not as popular as CSS selectors, XPath expressions offer more
power because besides navigating the structure, it can also look at the
content. Using XPath, you're able to select things like: *select the link
that contains the text "Next Page"*. This makes XPath very fitting to the task
of scraping, and we encourage you to learn XPath even if you already know how to
construct CSS selectors, it will make scraping much easier.
We won't cover much of XPath here, but you can read more about :ref:`using XPath
with Scrapy Selectors here <topics-selectors>`. To learn more about XPath, we
recommend `this tutorial to learn XPath through examples
<http://zvon.org/comp/r/tut-XPath_1.html>`_, and `this tutorial to learn "how
to think in XPath" <http://plasmasturm.org/log/xpath101/>`_.
.. _XPath: https://www.w3.org/TR/xpath
.. _CSS: https://www.w3.org/TR/selectors
Extracting quotes and authors
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Now that you know a bit about selection and extraction, let's complete our
spider by writing the code to extract the quotes from the web page.
Each quote in http://quotes.toscrape.com is represented by HTML elements that look
like this:
.. code-block:: html
<div class="quote">
<span class="text">“The world as we have created it is a process of our
thinking. It cannot be changed without changing our thinking.”</span>
<span>
by <small class="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
Let's open up scrapy shell and play a bit to find out how to extract the data
we want::
$ scrapy shell 'http://quotes.toscrape.com'
We get a list of selectors for the quote HTML elements with::
>>> response.css("div.quote")
Each of the selectors returned by the query above allows us to run further
queries over their sub-elements. Let's assign the first selector to a
variable, so that we can run our CSS selectors directly on a particular quote::
>>> quote = response.css("div.quote")[0]
Now, let's extract ``title``, ``author`` and the ``tags`` from that quote
using the ``quote`` object we just created::
>>> title = quote.css("span.text::text").extract_first()
>>> title
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").extract_first()
>>> author
'Albert Einstein'
Given that the tags are a list of strings, we can use the ``.extract()`` method
to get all of them::
>>> tags = quote.css("div.tags a.tag::text").extract()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']
Having figured out how to extract each bit, we can now iterate over all the
quotes elements and put them together into a Python dictionary::
>>> for quote in response.css("div.quote"):
... text = quote.css("span.text::text").extract_first()
... author = quote.css("small.author::text").extract_first()
... tags = quote.css("div.tags a.tag::text").extract()
... print(dict(text=text, author=author, tags=tags))
{'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
{'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}
... a few more of these, omitted for brevity
>>>
Extracting data in our spider
------------------------------
Let's get back to our spider. Until now, it doesn't extract any data in
particular, just saves the whole HTML page to a local file. Let's integrate the
extraction logic above into our spider.
A Scrapy spider typically generates many dictionaries containing the data
extracted from the page. To do that, we use the ``yield`` Python keyword
in the callback, as you can see below::
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('span small::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
If you run this spider, it will output the extracted data with the log::
2016-09-19 18:57:19 [scrapy] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}
.. _storing-data:
Storing the scraped data
========================
The simplest way to store the scraped data is by using :ref:`Feed exports
<topics-feed-exports>`, with the following command::
scrapy crawl quotes -o quotes.json
That will generate an ``quotes.json`` file containing all scraped items,
serialized in `JSON`_.
For historic reasons, Scrapy appends to a given file instead of overwriting
its contents. If you run this command twice without removing the file
before the second time, you'll end up with a broken JSON file.
You can also used other formats, like `JSON Lines`_::
scrapy crawl quotes -o quotes.jl
The `JSON Lines`_ format is useful because it's stream-like, you can easily
append new records to it. It doesn't have the same problem of JSON when you run
twice. Also, as each record is a separate line, you can process big files
without having to fit everything in memory, there are tools like `JQ`_ to help
doing that at the command-line.
In small projects (like the one in this tutorial), that should be enough.
However, if you want to perform more complex things with the scraped items, you
can write an :ref:`Item Pipeline <topics-item-pipeline>`. A placeholder file
for Item Pipelines has been set up for you when the project is created, in
``tutorial/pipelines.py``. Though you don't need to implement any item
pipelines if you just want to store the scraped items.
.. _JSON Lines: http://jsonlines.org
.. _JQ: https://stedolan.github.io/jq
Following links
===============
Let's say, instead of just scraping the stuff from the first two pages
from http://quotes.toscrape.com, you want quotes from all the pages in the website.
Now that you know how to extract data from pages, let's see how to follow links
from them.
First thing is to extract the link to the page we want to follow. Examining
our page, we can see there is a link to the next page with the following
markup:
.. code-block:: html
<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
</li>
</ul>
We can try extracting it in the shell::
>>> response.css('li.next a').extract_first()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'
This gets the anchor element, but we want the attribute ``href``. For that,
Scrapy supports a CSS extension that let's you select the attribute contents,
like this::
>>> response.css('li.next a::attr(href)').extract_first()
'/page/2/'
Let's see now our spider modified to recursively follow the link to the next
page, extracting data from it::
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('span small::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Now, after extracting the data, the ``parse()`` method looks for the link to
the next page, builds a full absolute URL using the
:meth:`~scrapy.http.Response.urljoin` method (since the links can be
relative) and yields a new request to the next page, registering itself as
callback to handle the data extraction for the next page and to keep the
crawling going through all the pages.
What you see here is Scrapy's mechanism of following links: when you yield
a Request in a callback method, Scrapy will schedule that request to be sent
and register a callback method to be executed when that request finishes.
Using this, you can build complex crawlers that follow links according to rules
you define, and extract different kinds of data depending on the page it's
visiting.
In our example, it creates a sort of loop, following all the links to the next page
until it doesn't find one -- handy for crawling blogs, forums and other sites with
pagination.
More examples and patterns
--------------------------
Here is another spider that illustrates callbacks and following links,
this time for scraping author information::
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
# follow links to author pages
for href in response.css('.author a::attr(href)').extract():
yield scrapy.Request(response.urljoin(href),
callback=self.parse_author)
# follow pagination links
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
This spider will start from the main page, it will follow all the links to the
authors pages calling the ``parse_author`` callback for each of them, and also
the pagination links with the ``parse`` callback as we saw before.
The ``parse_author`` callback defines a helper function to extract and cleanup the
data from a CSS query and yields the Python dict with the author data.
Another interesting thing this spider demonstrates is that, even if there are
many quotes from the same author, we don't need to worry about visiting the
same author page multiple times. By default, Scrapy filters out duplicated
requests to URLs already visited, avoiding the problem of hitting servers too
much because of a programming mistake. This can be configured by the setting
:setting:`DUPEFILTER_CLASS`.
Hopefully by now you have a good understanding of how to use the mechanism
of following links and callbacks with Scrapy.
As yet another example spider that leverages the mechanism of following links,
check out the :class:`~scrapy.spiders.CrawlSpider` class for a generic
spider that implements a small rules engine that you can use to write your
crawlers on top of it.
Also, a common pattern is to build an item with data from more than one page,
using a :ref:`trick to pass additional data to the callbacks
<topics-request-response-ref-request-callback-arguments>`.
Using spider arguments
======================
You can provide command line arguments to your spiders by using the ``-a``
option when running them::
scrapy crawl quotes -o quotes-humor.json -a tag=humor
These arguments are passed to the Spider's ``__init__`` method and become
spider attributes by default.
In this example, the value provided for the ``tag`` argument will be available
via ``self.tag``. You can use this to make your spider fetch only quotes
with a specific tag, building the URL based on the argument::
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
url = 'http://quotes.toscrape.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'tag/' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('span small a::text').extract_first(),
}
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, self.parse)
If you pass the ``tag=humor`` argument to this spider, you'll notice that it
will only visit URLs from the ``humor`` tag, such as
``http://quotes.toscrape.com/tag/humor``.
You can :ref:`learn more about handling spider arguments here <spiderargs>`.
Next steps
==========
This tutorial covered only the basics of Scrapy, but there's a lot of other
features not mentioned here. Check the :ref:`topics-whatelse` section in
:ref:`intro-overview` chapter for a quick overview of the most important ones.
You can continue from the section :ref:`section-basics` to know more about the
command-line tool, spiders, selectors and other things the tutorial hasn't covered like
modeling the scraped data. If you prefer to play with an example project, check
the :ref:`intro-examples` section.
.. _JSON: https://en.wikipedia.org/wiki/JSON
.. _dirbot: https://github.com/scrapy/dirbot