1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-25 09:03:57 +00:00
scrapy/docs/experimental/images.rst

264 lines
8.4 KiB
ReStructuredText
Raw Normal View History

2009-08-18 09:35:32 -03:00
.. _topics-images:
2009-09-01 21:07:47 -03:00
==================
Downloading Images
==================
.. currentmodule:: scrapy.contrib.pipeline.images
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
Scrapy provides an :doc:`item pipeline </topics/item-pipeline>` for downloading
images attached to a particular item. For example, when you scrape products and
also want to download their images locally.
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
This pipeline, called the Images Pipeline and implemented in the
:class:`ImagesPipeline` class, provides a convenient way for
downloading and storing images locally with some additional features:
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
* Convert all downloaded images to a common format (JPG) and mode (RGB)
* Avoid re-downloading images which were downloaded recently
* Thumbnail generation
* Check images width/height to make sure they meet a minimum constraint
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
This pipeline also keeps an internal queue of those images which are currently
being scheduled for download, and connects those items that arrive containing
the same image, to that queue. This avoids downloading the same image more than
once when it's shared by several items.
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
Using the Images Pipeline
=========================
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
The typical workflow, when using the :class:`ImagesPipeline` goes like
this:
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
1. In a Spider, you scrape an item and put the URLs of its images into a
pre-defined field, for example ``image_urls``.
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
2. The item is returned from the spider and goes to the item pipeline.
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
3. When the item reaches the :class:`ImagesPipeline`, the URLs in the
``image_urls`` attribute are scheduled for download using the standard
Scrapy scheduler and downloader (which means the scheduler and downloader
middlewares are reused), but higher priority to process them before other
pages to scrape. The item remains "locked" at that particular pipeline stage
until the images have finish downloading (or fail for some reason).
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
4. When the images finish downloading (or fail for some reason) the images gets
another field populated with the data of the images downloaded, for example,
``images``. This attribute is a list of dictionaries containing information
about the image downloaded, such as the downloaded path, and the original
scraped url. This images in the list of the ``images`` field retains the
same order of the original ``image_urls`` field, which is useful if you
decide to use the first image in the list as the primary image.
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
.. setting:: IMAGES_DIR
2009-08-18 09:35:32 -03:00
The first thing we need to do is tell the pipeline where to store the
2009-09-01 21:07:47 -03:00
downloaded images, by setting :setting:`IMAGES_DIR`::
2009-08-18 09:35:32 -03:00
IMAGES_DIR = '/path/to/valid/dir'
Then, as seen on the workflow, the pipeline will get the URLs of the images to
download from the item. In order to do this, you must override the
2009-09-01 21:07:47 -03:00
:meth:`~ImagesPipeline.get_media_requests` method and return a Request for each
image URL::
2009-08-18 09:35:32 -03:00
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
2009-09-01 21:07:47 -03:00
yield Request(image_url)
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
Those requests will be processed by the pipeline, and they have finished
downloading the results will be sent to the
:meth:`~ImagesPipeline.item_completed` method, as a list of dictionaries. Each
dictionary will contain status and information about the download, and the list
of dictionaries will retain the original order of the requests returned from
the :meth:`~ImagesPipeline.get_media_requests` method::
2009-08-18 09:35:32 -03:00
results = [(True, 'path#checksum'), ..., (False, Failure)]
2009-09-01 21:07:47 -03:00
There is one additional method: :meth:`~ImagesPipeline.item_completed` which
must return the output value that will be sent to further item pipeline stages,
so you must return (or drop) the item as in any pipeline.
2009-08-18 09:35:32 -03:00
We will override it to store the resulting image paths (passed in results) back
in the item::
2009-09-01 21:07:47 -03:00
# XXX: improve this example and add a condition for dropping images
2009-08-18 09:35:32 -03:00
def item_completed(self, results, item, info):
item['image_paths'] = [result.split('#')[0] for succes, result in results if succes]
return item
So, the complete example of our pipeline looks like this::
from scrapy.contrib.pipeline.images import ImagesPipeline
2009-09-01 21:07:47 -03:00
# XXX: improve this example and add a condition for dropping images
2009-08-18 09:35:32 -03:00
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
2009-09-01 21:07:47 -03:00
yield Request(image_url)
2009-08-18 09:35:32 -03:00
def item_completed(self, results, item, info):
item['image_paths'] = [result.split('#')[0] for succes, result in results if succes]
return item
.. _topics-images-expiration:
Image expiration
-----------------
2009-09-01 21:07:47 -03:00
.. setting:: IMAGES_EXPIRES
The Image Pipeline avoids downloading images that were downloaded recently. To
adjust this delay use the :setting:`IMAGES_EXPIRES` setting, which specifies
the delay in days::
# 90 days of delay for image expiration
IMAGES_EXPIRES = 90
2009-08-18 09:35:32 -03:00
.. _topics-images-thumbnails:
2009-09-01 21:07:47 -03:00
Thumbnail generation
--------------------
The Images Pipeline can automatically create thumbnails of the downloaded
images.
In order use this feature you must set the :attr:`~ImagesPipeline.THUMBS`
to a tuple of ``(size_name, (width, height))`` tuples.
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
The `Python Imaging Library`_ is used for thumbnailing, so you need that
library.
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
.. _Python Imaging Library: http://www.pythonware.com/products/pil/
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
Here are some examples examples.
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
Using numeric names::
2009-08-18 09:35:32 -03:00
THUMBS = (
('50', (50, 50)),
('110', (110, 110)),
)
2009-09-01 21:07:47 -03:00
Using textual names::
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
THUMBS = (
('small', (50, 50)),
('big', (270, 270)),
)
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
When you use this feature, the Images Pipeline will create thumbnails of the
each specified size with this format::
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
IMAGES_DIR/thumbs/<image_id>/<size_name>.jpg
Where:
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
* ``<image_id>`` is the `SHA1 hash`_ of the image url
* and ``<size_name>`` is the one specified in ``THUMBS`` attribute
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
.. _SHA1 hash: http://en.wikipedia.org/wiki/SHA_hash_functions
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
Example with previous THUMB attribute::
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
IMAGES_DIR/thumbs/63bbfea82b8880ed33cdb762aa11fab722a90a24/50.jpg
IMAGES_DIR/thumbs/63bbfea82b8880ed33cdb762aa11fab722a90a24/110.jpg
IMAGES_DIR/thumbs/63bbfea82b8880ed33cdb762aa11fab722a90a24/270.jpg
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
.. _topics-images-size:
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
Checking image size
-------------------
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
.. setting:: IMAGES_MIN_HEIGHT
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
.. setting:: IMAGES_MIN_WIDTH
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
You can drop images which are too small, by specifying the minimum allowed size
in the :setting:`IMAGES_MIN_HEIGHT` and :setting:`IMAGES_MIN_WIDTH` settings.
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
For example::
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
.. _ref-images:
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
API Reference
=============
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
.. module:: scrapy.contrib.pipeline.images
:synopsis: Images Pipeline
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
ImagesPipeline
--------------
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
.. class:: ImagesPipeline
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
A pipeline to download images attached to items, for example product images.
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
To enable this pipeline you must set :setting:`IMAGES_DIR` to a valid
directory that will be used for storing the downloaded images.
2009-08-18 09:35:32 -03:00
.. method:: store_image(key, image, buf, info)
2009-09-01 21:07:47 -03:00
2009-08-18 09:35:32 -03:00
Override this method with specific code to persist an image.
This method is used to persist the full image and any defined
thumbnail, one a time.
Return value is ignored.
.. method:: stat_key(key, info)
2009-09-01 21:07:47 -03:00
2009-08-18 09:35:32 -03:00
Override this method with specific code to stat an image.
This method should return and dictionary with two parameters:
* ``last_modified``: the last modification time in seconds since the epoch
* ``checksum``: the md5sum of the content of the stored image if found
If an exception is raised or ``last_modified`` is ``None``, then the image
will be re-downloaded.
If the difference in days between last_modified and now is greater than
:setting:`IMAGES_EXPIRES` settings, then the image will be re-downloaded
The checksum value is appended to returned image path after a hash sign
(#), if ``checksum`` is ``None``, then nothing is appended including the
hash sign.
.. method:: get_media_requests(item, info)
2009-09-01 21:07:47 -03:00
Return a list of Request objects to download images for this item.
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
Must return ``None`` or an iterable.
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
By default it returns ``None`` (no images to download).
2009-08-18 09:35:32 -03:00
.. method:: item_completed(results, item, info)
2009-09-01 21:07:47 -03:00
Method called when all image requests for a single item have been
downloaded (or failed).
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
The output of this method is used as the output of the Image Pipeline
stage.
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
This method typically returns the item itself or raises a
:exc:`~scrapy.core.exceptions.DropItem` exception.
2009-08-18 09:35:32 -03:00
2009-09-01 21:07:47 -03:00
By default, it returns the item.
.. attribute:: THUMBS
Thumbnail generation configuration, see :ref:`topics-images-thumbnails`.
2009-08-18 09:35:32 -03:00