2009-08-18 09:35:32 -03:00
|
|
|
.. _topics-images:
|
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
==================
|
|
|
|
Downloading Images
|
|
|
|
==================
|
|
|
|
|
|
|
|
.. currentmodule:: scrapy.contrib.pipeline.images
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
Scrapy provides an :doc:`item pipeline </topics/item-pipeline>` for downloading
|
|
|
|
images attached to a particular item. For example, when you scrape products and
|
|
|
|
also want to download their images locally.
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
This pipeline, called the Images Pipeline and implemented in the
|
|
|
|
:class:`ImagesPipeline` class, provides a convenient way for
|
|
|
|
downloading and storing images locally with some additional features:
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
* Convert all downloaded images to a common format (JPG) and mode (RGB)
|
|
|
|
* Avoid re-downloading images which were downloaded recently
|
|
|
|
* Thumbnail generation
|
|
|
|
* Check images width/height to make sure they meet a minimum constraint
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
This pipeline also keeps an internal queue of those images which are currently
|
|
|
|
being scheduled for download, and connects those items that arrive containing
|
|
|
|
the same image, to that queue. This avoids downloading the same image more than
|
|
|
|
once when it's shared by several items.
|
2009-08-18 09:35:32 -03:00
|
|
|
|
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
Using the Images Pipeline
|
|
|
|
=========================
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
The typical workflow, when using the :class:`ImagesPipeline` goes like
|
|
|
|
this:
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
1. In a Spider, you scrape an item and put the URLs of its images into a
|
|
|
|
pre-defined field, for example ``image_urls``.
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
2. The item is returned from the spider and goes to the item pipeline.
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
3. When the item reaches the :class:`ImagesPipeline`, the URLs in the
|
|
|
|
``image_urls`` attribute are scheduled for download using the standard
|
|
|
|
Scrapy scheduler and downloader (which means the scheduler and downloader
|
|
|
|
middlewares are reused), but higher priority to process them before other
|
|
|
|
pages to scrape. The item remains "locked" at that particular pipeline stage
|
|
|
|
until the images have finish downloading (or fail for some reason).
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
4. When the images finish downloading (or fail for some reason) the images gets
|
|
|
|
another field populated with the data of the images downloaded, for example,
|
|
|
|
``images``. This attribute is a list of dictionaries containing information
|
|
|
|
about the image downloaded, such as the downloaded path, and the original
|
|
|
|
scraped url. This images in the list of the ``images`` field retains the
|
|
|
|
same order of the original ``image_urls`` field, which is useful if you
|
|
|
|
decide to use the first image in the list as the primary image.
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
.. setting:: IMAGES_DIR
|
2009-08-18 09:35:32 -03:00
|
|
|
|
|
|
|
The first thing we need to do is tell the pipeline where to store the
|
2009-09-01 21:07:47 -03:00
|
|
|
downloaded images, by setting :setting:`IMAGES_DIR`::
|
2009-08-18 09:35:32 -03:00
|
|
|
|
|
|
|
IMAGES_DIR = '/path/to/valid/dir'
|
|
|
|
|
|
|
|
Then, as seen on the workflow, the pipeline will get the URLs of the images to
|
|
|
|
download from the item. In order to do this, you must override the
|
2009-09-01 21:07:47 -03:00
|
|
|
:meth:`~ImagesPipeline.get_media_requests` method and return a Request for each
|
|
|
|
image URL::
|
2009-08-18 09:35:32 -03:00
|
|
|
|
|
|
|
def get_media_requests(self, item, info):
|
|
|
|
for image_url in item['image_urls']:
|
2009-09-01 21:07:47 -03:00
|
|
|
yield Request(image_url)
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
Those requests will be processed by the pipeline, and they have finished
|
|
|
|
downloading the results will be sent to the
|
|
|
|
:meth:`~ImagesPipeline.item_completed` method, as a list of dictionaries. Each
|
|
|
|
dictionary will contain status and information about the download, and the list
|
|
|
|
of dictionaries will retain the original order of the requests returned from
|
|
|
|
the :meth:`~ImagesPipeline.get_media_requests` method::
|
2009-08-18 09:35:32 -03:00
|
|
|
|
|
|
|
results = [(True, 'path#checksum'), ..., (False, Failure)]
|
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
There is one additional method: :meth:`~ImagesPipeline.item_completed` which
|
|
|
|
must return the output value that will be sent to further item pipeline stages,
|
|
|
|
so you must return (or drop) the item as in any pipeline.
|
2009-08-18 09:35:32 -03:00
|
|
|
|
|
|
|
We will override it to store the resulting image paths (passed in results) back
|
|
|
|
in the item::
|
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
# XXX: improve this example and add a condition for dropping images
|
2009-08-18 09:35:32 -03:00
|
|
|
def item_completed(self, results, item, info):
|
|
|
|
item['image_paths'] = [result.split('#')[0] for succes, result in results if succes]
|
|
|
|
|
|
|
|
return item
|
|
|
|
|
|
|
|
So, the complete example of our pipeline looks like this::
|
|
|
|
|
|
|
|
from scrapy.contrib.pipeline.images import ImagesPipeline
|
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
# XXX: improve this example and add a condition for dropping images
|
|
|
|
|
2009-08-18 09:35:32 -03:00
|
|
|
class MyImagesPipeline(ImagesPipeline):
|
|
|
|
|
|
|
|
def get_media_requests(self, item, info):
|
|
|
|
for image_url in item['image_urls']:
|
2009-09-01 21:07:47 -03:00
|
|
|
yield Request(image_url)
|
2009-08-18 09:35:32 -03:00
|
|
|
|
|
|
|
def item_completed(self, results, item, info):
|
|
|
|
item['image_paths'] = [result.split('#')[0] for succes, result in results if succes]
|
|
|
|
|
|
|
|
return item
|
|
|
|
|
|
|
|
.. _topics-images-expiration:
|
|
|
|
|
|
|
|
Image expiration
|
|
|
|
-----------------
|
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
.. setting:: IMAGES_EXPIRES
|
|
|
|
|
|
|
|
The Image Pipeline avoids downloading images that were downloaded recently. To
|
|
|
|
adjust this delay use the :setting:`IMAGES_EXPIRES` setting, which specifies
|
|
|
|
the delay in days::
|
|
|
|
|
|
|
|
# 90 days of delay for image expiration
|
|
|
|
IMAGES_EXPIRES = 90
|
2009-08-18 09:35:32 -03:00
|
|
|
|
|
|
|
.. _topics-images-thumbnails:
|
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
Thumbnail generation
|
|
|
|
--------------------
|
|
|
|
|
|
|
|
The Images Pipeline can automatically create thumbnails of the downloaded
|
|
|
|
images.
|
|
|
|
|
|
|
|
In order use this feature you must set the :attr:`~ImagesPipeline.THUMBS`
|
|
|
|
to a tuple of ``(size_name, (width, height))`` tuples.
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
The `Python Imaging Library`_ is used for thumbnailing, so you need that
|
|
|
|
library.
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
.. _Python Imaging Library: http://www.pythonware.com/products/pil/
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
Here are some examples examples.
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
Using numeric names::
|
2009-08-18 09:35:32 -03:00
|
|
|
|
|
|
|
THUMBS = (
|
|
|
|
('50', (50, 50)),
|
|
|
|
('110', (110, 110)),
|
|
|
|
)
|
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
Using textual names::
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
THUMBS = (
|
|
|
|
('small', (50, 50)),
|
|
|
|
('big', (270, 270)),
|
|
|
|
)
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
When you use this feature, the Images Pipeline will create thumbnails of the
|
|
|
|
each specified size with this format::
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
IMAGES_DIR/thumbs/<image_id>/<size_name>.jpg
|
|
|
|
|
|
|
|
Where:
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
* ``<image_id>`` is the `SHA1 hash`_ of the image url
|
|
|
|
* and ``<size_name>`` is the one specified in ``THUMBS`` attribute
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
.. _SHA1 hash: http://en.wikipedia.org/wiki/SHA_hash_functions
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
Example with previous THUMB attribute::
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
IMAGES_DIR/thumbs/63bbfea82b8880ed33cdb762aa11fab722a90a24/50.jpg
|
|
|
|
IMAGES_DIR/thumbs/63bbfea82b8880ed33cdb762aa11fab722a90a24/110.jpg
|
|
|
|
IMAGES_DIR/thumbs/63bbfea82b8880ed33cdb762aa11fab722a90a24/270.jpg
|
2009-08-18 09:35:32 -03:00
|
|
|
|
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
.. _topics-images-size:
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
Checking image size
|
|
|
|
-------------------
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
.. setting:: IMAGES_MIN_HEIGHT
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
.. setting:: IMAGES_MIN_WIDTH
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
You can drop images which are too small, by specifying the minimum allowed size
|
|
|
|
in the :setting:`IMAGES_MIN_HEIGHT` and :setting:`IMAGES_MIN_WIDTH` settings.
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
For example::
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
IMAGES_MIN_HEIGHT = 110
|
|
|
|
IMAGES_MIN_WIDTH = 110
|
2009-08-18 09:35:32 -03:00
|
|
|
|
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
.. _ref-images:
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
API Reference
|
|
|
|
=============
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
.. module:: scrapy.contrib.pipeline.images
|
|
|
|
:synopsis: Images Pipeline
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
ImagesPipeline
|
|
|
|
--------------
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
.. class:: ImagesPipeline
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
A pipeline to download images attached to items, for example product images.
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
To enable this pipeline you must set :setting:`IMAGES_DIR` to a valid
|
|
|
|
directory that will be used for storing the downloaded images.
|
2009-08-18 09:35:32 -03:00
|
|
|
|
|
|
|
.. method:: store_image(key, image, buf, info)
|
2009-09-01 21:07:47 -03:00
|
|
|
|
2009-08-18 09:35:32 -03:00
|
|
|
Override this method with specific code to persist an image.
|
|
|
|
|
|
|
|
This method is used to persist the full image and any defined
|
|
|
|
thumbnail, one a time.
|
|
|
|
|
|
|
|
Return value is ignored.
|
|
|
|
|
|
|
|
|
|
|
|
.. method:: stat_key(key, info)
|
2009-09-01 21:07:47 -03:00
|
|
|
|
2009-08-18 09:35:32 -03:00
|
|
|
Override this method with specific code to stat an image.
|
|
|
|
|
|
|
|
This method should return and dictionary with two parameters:
|
|
|
|
|
|
|
|
* ``last_modified``: the last modification time in seconds since the epoch
|
|
|
|
* ``checksum``: the md5sum of the content of the stored image if found
|
|
|
|
|
|
|
|
If an exception is raised or ``last_modified`` is ``None``, then the image
|
|
|
|
will be re-downloaded.
|
|
|
|
|
|
|
|
If the difference in days between last_modified and now is greater than
|
|
|
|
:setting:`IMAGES_EXPIRES` settings, then the image will be re-downloaded
|
|
|
|
|
|
|
|
The checksum value is appended to returned image path after a hash sign
|
|
|
|
(#), if ``checksum`` is ``None``, then nothing is appended including the
|
|
|
|
hash sign.
|
|
|
|
|
|
|
|
.. method:: get_media_requests(item, info)
|
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
Return a list of Request objects to download images for this item.
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
Must return ``None`` or an iterable.
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
By default it returns ``None`` (no images to download).
|
2009-08-18 09:35:32 -03:00
|
|
|
|
|
|
|
.. method:: item_completed(results, item, info)
|
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
Method called when all image requests for a single item have been
|
|
|
|
downloaded (or failed).
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
The output of this method is used as the output of the Image Pipeline
|
|
|
|
stage.
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
This method typically returns the item itself or raises a
|
2009-08-21 15:05:06 -03:00
|
|
|
:exc:`~scrapy.core.exceptions.DropItem` exception.
|
2009-08-18 09:35:32 -03:00
|
|
|
|
2009-09-01 21:07:47 -03:00
|
|
|
By default, it returns the item.
|
|
|
|
|
|
|
|
.. attribute:: THUMBS
|
|
|
|
|
|
|
|
Thumbnail generation configuration, see :ref:`topics-images-thumbnails`.
|
2009-08-18 09:35:32 -03:00
|
|
|
|