1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-23 22:44:03 +00:00

Completing the data extraction section

This commit is contained in:
Valdir Stumm Junior 2016-09-19 19:19:31 -03:00
parent 2a409d1d95
commit fee07835f2

View File

@ -106,8 +106,8 @@ and defines some attributes and methods:
the scraped data as dicts and also finding new URLs to
follow and creating new requests (:class:`~scrapy.http.Request`) from them.
How to run your spider
----------------------
How to run our spider
---------------------
To put our spider to work, go to the project's top level directory and run::
@ -148,12 +148,14 @@ objects and calls the ``parse`` callback method passing the response as
argument.
Simplifying your spider
-----------------------
Instead of defining the :meth:`~scrapy.spiders.Spider.start_requests` method
generating :class:`scrapy.Request <scrapy.http.Request>`
objects from URLs, you can just put those URLs in the
:attr:`~scrapy.spiders.Spider.start_urls` attribute::
A shortcut to the start_requests method
---------------------------------------
Instead of implementing a :meth:`~scrapy.spiders.Spider.start_requests` method
that generates :class:`scrapy.Request <scrapy.http.Request>` objects from URLs,
you can just define a :attr:`~scrapy.spiders.Spider.start_urls` class attribute
with a list of URLs. This list will then be used by the default implementation
of :meth:`~scrapy.spiders.Spider.start_requests` to create the initial requests
for your spider::
import scrapy
@ -174,7 +176,8 @@ objects from URLs, you can just put those URLs in the
The :meth:`~scrapy.spiders.Spider.parse` method will be called to handle
each of the requests for those URLs, even though we haven't explicitely told
Scrapy to do so. This happens because :meth:`~scrapy.spiders.Spider.parse`
is Scrapy's default callback method.
is Scrapy's default callback method that is called for any request that have
been generated with no callback explicitely assigned to handle it.
Extracting data
@ -293,7 +296,104 @@ Extraction wrap-up
Now that you know a bit about selection and extraction, let's complete our
spider by writing the code to extract the quotes from the webpage.
TODO: show how to extract quotes and integrate spider code here.
Each quote in http://quotes.toscrape.com is represented by HTML code that looks
like this::
<div class="quote">
<span class="text">“The world as we have created it is a process of our
thinking. It cannot be changed without changing our thinking.”</span>
<span>
by <small class="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
Let's open up scrapy shell and play a bit to find out how to extract the data
we want::
$ scrapy shell http://quotes.toscrape.com
We get a list of selectors to the quotes using::
>>> response.css("div.quote")
Each of the selectors returned by the query above allows us to run further
queries over the quotes itselves. Let's assign the first selector to a
variable, so that we can run our CSS selectors directly on a particular quote::
>>> quote = response.css("div.quote")[0]
Now, let's extract ``title``, ``author`` and the ``tags`` from that quote
using the ``quote`` object we just created::
>>> title = quote.css("span.text ::text").extract_first()
>>> title
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author ::text").extract_first()
>>> author
'Albert Einstein'
Given that the tags is a list of strings, we can use the ``.extract()`` method
to get all of them::
>>> tags = quote.css("div.tags a.tag ::text").extract()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']
Now, we can iterate over all the quotes in the page and use the CSS selectors
we defined to extract data::
>>> for quote in response.css("div.quote"):
... text = quote.css("span.text ::text").extract_first()
... author = quote.css("small.author ::text").extract_first()
... tags = quote.css("div.tags a.tag ::text").extract()
... print("{} - {} - {}".format(text, author, tags))
Extracting data in our spider
------------------------------
Until now, the spider we built doesn't extract any data in particular. I just
saves the whole HTML page to a local file. Now, let's integrate the extraction
logic above in our spider.
A Scrapy spider typically generates many dictionaries containing the data
extracted from the page. To do that, we use the ``yield`` Python keyword, as
you can see below::
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('span small::text').extract_first(),
'tags': quote.css("div.tags a.tag ::text").extract(),
}
If you run this spider, it will output the extracted data with the log::
2016-09-19 18:57:19 [scrapy] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}
:ref:`Later in the tutorial <storing-data>`, we will see how to save this data to a file.
Following links
@ -450,6 +550,8 @@ If you pass the ``tag=humor`` argument to this spider, you'll notice that it
will only visit URLs from the ``humor`` tag, such as
``http://quotes.toscrape.com/tag/humor``.
.. _storing-data:
Storing the scraped data
========================