mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-23 23:03:42 +00:00
more reviewing and editing, minor restructure, syntax fixes
This commit is contained in:
parent
8975371a57
commit
125b691102
@ -14,7 +14,9 @@ This tutorial will walk you through these tasks:
|
||||
|
||||
1. Creating a new Scrapy project
|
||||
2. Writing a :ref:`spider <topics-spiders>` to crawl a site and extract data
|
||||
3. Exporting the scraped data using command line
|
||||
3. Exporting the scraped data using the command line
|
||||
4. Change spider to recursively follow links
|
||||
5. Using spider arguments
|
||||
|
||||
Scrapy is written in Python_. If you're new to the language you might want to
|
||||
start by getting an idea of what the language is like, to get the most out of
|
||||
@ -43,7 +45,7 @@ This will create a ``tutorial`` directory with the following contents::
|
||||
tutorial/ # project's Python module, you'll import your code from here
|
||||
__init__.py
|
||||
|
||||
items.py # project items file
|
||||
items.py # project items definition file
|
||||
|
||||
pipelines.py # project pipelines file
|
||||
|
||||
@ -109,7 +111,8 @@ and defines some attributes and methods:
|
||||
How to run our spider
|
||||
---------------------
|
||||
|
||||
To put our spider to work, go to the project's top level directory and run::
|
||||
To put our spider to work, go to the project's top level directory (``cd
|
||||
tutorial``) and run::
|
||||
|
||||
scrapy crawl quotes
|
||||
|
||||
@ -141,6 +144,7 @@ for the respective URLs, as our ``parse`` method instructs.
|
||||
|
||||
What just happened under the hood?
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Scrapy schedules the :class:`scrapy.Request <scrapy.http.Request>` objects
|
||||
returned by the ``start_requests`` method of the Spider. Upon receiving
|
||||
a response for each one, it instantiates :class:`scrapy.http.Response`
|
||||
@ -173,11 +177,11 @@ for your spider::
|
||||
with open(filename, 'wb') as f:
|
||||
f.write(response.body)
|
||||
|
||||
The :meth:`~scrapy.spiders.Spider.parse` method will be called to handle
|
||||
each of the requests for those URLs, even though we haven't explicitely told
|
||||
Scrapy to do so. This happens because :meth:`~scrapy.spiders.Spider.parse`
|
||||
is Scrapy's default callback method that is called for any request that have
|
||||
been generated with no callback explicitely assigned to handle it.
|
||||
The :meth:`~scrapy.spiders.Spider.parse` method will be called to handle each
|
||||
of the requests for those URLs, even though we haven't explicitely told Scrapy
|
||||
to do so. This happens because :meth:`~scrapy.spiders.Spider.parse` is Scrapy's
|
||||
default callback method, which is called for requests without an explicitely
|
||||
assigned callback.
|
||||
|
||||
|
||||
Extracting data
|
||||
@ -224,10 +228,14 @@ To extract the text from the title above, you can do::
|
||||
|
||||
There are two things to note here: one is that we've added ``::text`` to the
|
||||
CSS query, to mean that we want to select the text from inside the title element.
|
||||
If we don't specify ``::text``, we'd get the HTML tags::
|
||||
|
||||
The other is that the result of calling ``.extract()`` is a list, because we're
|
||||
dealing with an instance :class:`~scrapy.selector.SelectorList`. When you know
|
||||
you just want the first result, as in this case, you can do::
|
||||
>>> response.css('title').extract()
|
||||
[u'<title>Quotes to Scrape</title>']
|
||||
|
||||
The other thing is that the result of calling ``.extract()`` is a list, because
|
||||
we're dealing with an instance of :class:`~scrapy.selector.SelectorList`. When
|
||||
you know you just want the first result, as in this case, you can do::
|
||||
|
||||
>>> response.css('title::text').extract_first()
|
||||
u'Quotes to Scrape'
|
||||
@ -284,22 +292,24 @@ that contains the text "Next Page"**. This makes XPath very fitting to the task
|
||||
of scraping, and we encourage you to learn XPath even if you already know how to
|
||||
construct CSS selectors, it will make scraping much easier.
|
||||
|
||||
We won't cover much of XPath here. To learn more about XPath, we recommend `this tutorial to learn
|
||||
XPath through examples <http://zvon.org/comp/r/tut-XPath_1.html>`_, and `this
|
||||
tutorial to learn "how to think in XPath"
|
||||
<http://plasmasturm.org/log/xpath101/>`_.
|
||||
We won't cover much of XPath here. To learn more about XPath, we recommend
|
||||
`this tutorial to learn XPath through examples
|
||||
<http://zvon.org/comp/r/tut-XPath_1.html>`_, and `this tutorial to learn "how
|
||||
to think in XPath" <http://plasmasturm.org/log/xpath101/>`_.
|
||||
|
||||
.. _XPath: https://www.w3.org/TR/xpath
|
||||
.. _CSS: https://www.w3.org/TR/selectors
|
||||
|
||||
Extraction wrap-up
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
Extracting quotes and authors
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Now that you know a bit about selection and extraction, let's complete our
|
||||
spider by writing the code to extract the quotes from the webpage.
|
||||
|
||||
Each quote in http://quotes.toscrape.com is represented by HTML code that looks
|
||||
like this::
|
||||
Each quote in http://quotes.toscrape.com is represented by HTML elements that look
|
||||
like this:
|
||||
|
||||
.. code-block:: html
|
||||
|
||||
<div class="quote">
|
||||
<span class="text">“The world as we have created it is a process of our
|
||||
@ -322,12 +332,12 @@ we want::
|
||||
|
||||
$ scrapy shell http://quotes.toscrape.com
|
||||
|
||||
We get a list of selectors to the quotes using::
|
||||
We get a list of selectors for the quote HTML elements with::
|
||||
|
||||
>>> response.css("div.quote")
|
||||
|
||||
Each of the selectors returned by the query above allows us to run further
|
||||
queries over the quotes itselves. Let's assign the first selector to a
|
||||
queries over their sub-elements. Let's assign the first selector to a
|
||||
variable, so that we can run our CSS selectors directly on a particular quote::
|
||||
|
||||
>>> quote = response.css("div.quote")[0]
|
||||
@ -342,33 +352,33 @@ using the ``quote`` object we just created::
|
||||
>>> author
|
||||
'Albert Einstein'
|
||||
|
||||
Given that the tags is a list of strings, we can use the ``.extract()`` method
|
||||
Given that the tags are a list of strings, we can use the ``.extract()`` method
|
||||
to get all of them::
|
||||
|
||||
>>> tags = quote.css("div.tags a.tag ::text").extract()
|
||||
>>> tags
|
||||
['change', 'deep-thoughts', 'thinking', 'world']
|
||||
|
||||
Now, we can iterate over all the quotes in the page and use the CSS selectors
|
||||
we defined to extract data::
|
||||
Having figured out how to extract each bit, we can now iterate over all the
|
||||
quotes elements and put them together into a Python dictionary::
|
||||
|
||||
>>> for quote in response.css("div.quote"):
|
||||
... text = quote.css("span.text ::text").extract_first()
|
||||
... author = quote.css("small.author ::text").extract_first()
|
||||
... tags = quote.css("div.tags a.tag ::text").extract()
|
||||
... print("{} - {} - {}".format(text, author, tags))
|
||||
... print(dict(text=text, author=author, tags=tags))
|
||||
|
||||
|
||||
Extracting data in our spider
|
||||
------------------------------
|
||||
|
||||
Until now, the spider we built doesn't extract any data in particular. I just
|
||||
saves the whole HTML page to a local file. Now, let's integrate the extraction
|
||||
logic above in our spider.
|
||||
Let's get back to our spider. Until now, it doesn't extract any data in
|
||||
particular, just saves the whole HTML page to a local file. Let's integrate the
|
||||
extraction logic above into our spider.
|
||||
|
||||
A Scrapy spider typically generates many dictionaries containing the data
|
||||
extracted from the page. To do that, we use the ``yield`` Python keyword, as
|
||||
you can see below::
|
||||
extracted from the page. To do that, we use the ``yield`` Python keyword
|
||||
in the callback, as you can see below::
|
||||
|
||||
import scrapy
|
||||
|
||||
@ -395,7 +405,38 @@ If you run this spider, it will output the extracted data with the log::
|
||||
2016-09-19 18:57:19 [scrapy] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
|
||||
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}
|
||||
|
||||
:ref:`Later in the tutorial <storing-data>`, we will see how to save this data to a file.
|
||||
|
||||
.. _storing-data:
|
||||
|
||||
Storing the scraped data
|
||||
========================
|
||||
|
||||
The simplest way to store the scraped data is by using :ref:`Feed exports
|
||||
<topics-feed-exports>`, with the following command::
|
||||
|
||||
scrapy crawl quotes -o items.json
|
||||
|
||||
That will generate an ``items.json`` file containing all scraped items,
|
||||
serialized in `JSON`_.
|
||||
|
||||
You could've also used other formats, like `JSON Lines`_::
|
||||
|
||||
scrapy crawl quotes -o items.jl
|
||||
|
||||
The `JSON Lines`_ format is useful because it's stream-like, you can easily
|
||||
append new records to it. As each record is a separate line, you can also
|
||||
process big files without having to fit everything in memory, there are tools
|
||||
like `JQ`_ to help doing that at the command-line.
|
||||
|
||||
In small projects (like the one in this tutorial), that should be enough.
|
||||
However, if you want to perform more complex things with the scraped items, you
|
||||
can write an :ref:`Item Pipeline <topics-item-pipeline>`. As with Items, a
|
||||
placeholder file for Item Pipelines has been set up for you when the project is
|
||||
created, in ``tutorial/pipelines.py``. Though you don't need to implement any item
|
||||
pipelines if you just want to store the scraped items.
|
||||
|
||||
.. _JSON Lines: http://jsonlines.org
|
||||
.. _JQ: https://stedolan.github.io/jq
|
||||
|
||||
|
||||
Following links
|
||||
@ -511,17 +552,20 @@ much because of a programming mistake. This can be configured by the setting
|
||||
spider that implements a small rules engine that you can use to write your
|
||||
crawlers on top of it.
|
||||
|
||||
Adding a spider argument
|
||||
========================
|
||||
Using spider arguments
|
||||
======================
|
||||
|
||||
You can provide command line arguments to your spiders by using the ``-a``
|
||||
option when running them::
|
||||
|
||||
scrapy crawl quotes -o items.json -a tag=humor
|
||||
|
||||
These arguments are passed to the Spider's ``__init__`` method and become
|
||||
spider attributes by default.
|
||||
|
||||
In this example, the value provided for the ``tag`` argument will be available
|
||||
via a spider attribute. Using this, you could make your spider get only quotes
|
||||
tagged with a specific tag, building the URL based on the argument::
|
||||
via ``self.tag``. You can use this to make your spider fetch only quotes
|
||||
with a specific tag, building the URL based on the argument::
|
||||
|
||||
import scrapy
|
||||
|
||||
@ -553,25 +597,7 @@ If you pass the ``tag=humor`` argument to this spider, you'll notice that it
|
||||
will only visit URLs from the ``humor`` tag, such as
|
||||
``http://quotes.toscrape.com/tag/humor``.
|
||||
|
||||
.. _storing-data:
|
||||
|
||||
Storing the scraped data
|
||||
========================
|
||||
|
||||
The simplest way to store the scraped data is by using :ref:`Feed exports
|
||||
<topics-feed-exports>`, with the following command::
|
||||
|
||||
scrapy crawl quotes -o items.json
|
||||
|
||||
That will generate an ``items.json`` file containing all scraped items,
|
||||
serialized in `JSON`_.
|
||||
|
||||
In small projects (like the one in this tutorial), that should be enough.
|
||||
However, if you want to perform more complex things with the scraped items, you
|
||||
can write an :ref:`Item Pipeline <topics-item-pipeline>`. As with Items, a
|
||||
placeholder file for Item Pipelines has been set up for you when the project is
|
||||
created, in ``tutorial/pipelines.py``. Though you don't need to implement any item
|
||||
pipelines if you just want to store the scraped items.
|
||||
You can :ref:`learn more about handling spider arguments here <spiderargs>`.
|
||||
|
||||
Next steps
|
||||
==========
|
||||
@ -580,9 +606,10 @@ This tutorial covered only the basics of Scrapy, but there's a lot of other
|
||||
features not mentioned here. Check the :ref:`topics-whatelse` section in
|
||||
:ref:`intro-overview` chapter for a quick overview of the most important ones.
|
||||
|
||||
Then, we recommend you continue by playing with an example project (see
|
||||
:ref:`intro-examples`), and then continue with the section
|
||||
:ref:`section-basics`.
|
||||
You can continue from the section :ref:`section-basics` to know more about the
|
||||
command-line tool, spiders and other things the tutorial haven't covered like
|
||||
modeling the scraped data. If you prefer to play with an example project, check
|
||||
the :ref:`intro-examples` section.
|
||||
|
||||
.. _JSON: https://en.wikipedia.org/wiki/JSON
|
||||
.. _dirbot: https://github.com/scrapy/dirbot
|
||||
|
Loading…
x
Reference in New Issue
Block a user