.. _topics-images: .. module:: scrapy.contrib.pipeline.images =============== Handling Images =============== In Scrapy, the recommended way of handling image downloads is using the :class:`ImagesPipeline`. This pipeline provides convenient mechanisms to download and store images and also the following features: * Image format normalization (JPG) * Image expiration * Thumbnail creation * Image size checking Using a ImagesPipeline ======================= The typical workflow of working with a :class:`ImagesPipeline` goes like this: 1. In a Spider, you obtain the URLs of the images to be downloaded and store them in an Item. 2. An :class:`ImagesPipeline` process the Item, downloads the images and stores back their resulting paths in the processed Item We assume that if you're here you know how to handle the first part of the workflow (if not, please refer to the tutorial), so let's focus on the second part, using a :class:`ImagesPipeline`. :class:`ImagesPipeline` is a descendant of BaseImagesPipeline which in turn is a descendant of :class:`~scrapy.contrib.pipeline.MediaPipeline`, all of this classes provide overrideable methods, hooks and settings to customize their behaviour. So, for using the :class:`ImagesPipeline` you subclass it, override some methods with custom code and set some required settings. The first thing we need to do is tell the pipeline where to store the downloaded images, so set :setting:`IMAGES_DIR` to a valid directory name that will be used for this purpose:: IMAGES_DIR = '/path/to/valid/dir' Then, as seen on the workflow, the pipeline will get the URLs of the images to download from the item. In order to do this, you must override the :meth:`~scrapy.contrib.pipeline.MediaPipeline.get_media_requests` method and return a Request for each image URL:: def get_media_requests(self, item, info): for image_url in item['image_urls']: yield Request(image_url) Those requests will be processed by the pipeline, downloaded an when completed the processed results will be sent to the :meth:`~scrapy.contrib.pipeline.MediaPipeline.item_completed` method. The results will be a list of tuples, in wich each tuple indicates the sucess of the downloading process and the stored image path concatenated with the checksum of the image :: results = [(True, 'path#checksum'), ..., (False, Failure)] The :meth:`~scrapy.contrib.pipeline.MediaPipeline.item_completed` is also in charge of returning the output value to be used as the output of the pipeline stage, so we must return (or drop) the item as in any pipeline. We will override it to store the resulting image paths (passed in results) back in the item:: def item_completed(self, results, item, info): item['image_paths'] = [result.split('#')[0] for succes, result in results if succes] return item .. note:: This is a simplification of the actual process, it will be described with more detail in upcoming sections. So, the complete example of our pipeline looks like this:: from scrapy.contrib.pipeline.images import ImagesPipeline class MyImagesPipeline(ImagesPipeline): def get_media_requests(self, item, info): for image_url in item['image_urls']: yield Request(image_url) def item_completed(self, results, item, info): item['image_paths'] = [result.split('#')[0] for succes, result in results if succes] return item This is the most basic use of :class:`ImagesPipeline`, see upcoming sections for more details. .. _topics-images-expiration: Image expiration ----------------- XXX .. _topics-images-thumbnails: Creating thumbnails ------------------- As mentioned in the features, :class:`ImagesPipeline` can create thumbnails of the processed images. In order use this feature you must set the :attr:`~BaseImagesPipeline.THUMBS` to a tuple of tuples, in wich each sub-tuple is a pair of thumb_id string and a compatible python image library size (another tuple). See ``thumbnail`` method at http://www.pythonware.com/library/pil/handbook/image.htm. Example:: THUMBS = ( ('50', (50, 50)), ('110', (110, 110)), ('270', (270, 270)) ) When you use this feature, :class:`ImagesPipeline` will create thumbnails of the specified sizes in ``IMAGES_DIR/thumbs//.jpg``, where ```` is the ``sha1`` digest of the url of the image and ```` is the thumb_id string specified in THUMBS attribute. Example with previous THUMB attribute:: IMAGES_DIR/thumbs/image_sha1_digest/50.jpg IMAGES_DIR/thumbs/image_sha1_digest/110.jpg IMAGES_DIR/thumbs/image_sha1_digest/270.jpg .. _topics-images-size: Checking image size ------------------- You can skip the processing of an image if its size is less than a specified one. To use this set :setting:`IMAGES_MIN_HEIGHT` and/or :setting:`IMAGES_MIN_WIDTH` to your likings:: IMAGES_MIN_HEIGHT = 270 IMAGES_MIN_WIDTH = 270 .. _ref-images: Reference ========= ImagesPipeline -------------- .. class:: ImagesPipeline :class:`BaseImagesPipeline` descendant with filesystem support as image's store backend In order to enable this pipeline you must set :setting:`IMAGES_DIR` to a valid dirname that will be used for storing images. BaseImagesPipeline ------------------ .. class:: BaseImagesPipeline :class:`~scrapy.contrib.pipeline.MediaPipeline` descendant that implements image downloading and thumbnail generation logic. This pipeline tries to minimize network transfers and image processing, doing stat of the images and determining if image is new, uptodate or expired. `'new'` images are those that pipeline never processed and needs to be downloaded from supplier site the first time. `'uptodate'` images are the ones that the pipeline processed and are still valid images. `'expired'` images are those that pipeline already processed but the last modification was made long time ago, so a reprocessing is recommended to refresh it in case of change. :setting:`IMAGES_EXPIRES` setting controls the maximun days since an image was modified to consider it `uptodate`. Downloaded images are skipped if sizes aren't greater than :setting:`IMAGES_MIN_WIDTH` and :setting:`IMAGES_MIN_HEIGHT` limit. A proper log messages will be printed. .. attribute:: THUMBS Thumbnail generation configuration, see :ref:`topics-images-thumbnails` .. method:: store_image(key, image, buf, info) Override this method with specific code to persist an image. This method is used to persist the full image and any defined thumbnail, one a time. Return value is ignored. .. method:: stat_key(key, info) Override this method with specific code to stat an image. This method should return and dictionary with two parameters: * ``last_modified``: the last modification time in seconds since the epoch * ``checksum``: the md5sum of the content of the stored image if found If an exception is raised or ``last_modified`` is ``None``, then the image will be re-downloaded. If the difference in days between last_modified and now is greater than :setting:`IMAGES_EXPIRES` settings, then the image will be re-downloaded The checksum value is appended to returned image path after a hash sign (#), if ``checksum`` is ``None``, then nothing is appended including the hash sign. .. module:: scrapy.contrib.pipeline MediaPipeline ------------- .. class:: MediaPipeline Generic pipeline that handles the media associated with an item. .. method:: download(request, info) Defines how to request the download of media. Default gives high priority to media requests and use scheduler, shouldn't be necessary to override. This methods is called only if result for request isn't cached, request fingerprint is used as cache key. .. method:: media_to_download(request, info) Ongoing request hook pre-cache. This method is called every time a media is requested for download, and only once for the same request because return value is cached as media result. Returning a non-None value implies: * the return value is cached and piped into :meth:`item_media_downloaded` or :meth:`item_media_failed` * prevents downloading, this means calling :meth:`download` method. * :meth:`media_downloaded` or :meth:`media_failed` isn't called. .. method:: get_media_requests(item, info) Return a list of Request objects to download for this item. Should return ``None`` or an iterable. Defaults return ``None`` (no media to download) .. method:: media_downloaded(response, request, info) Method called on success download of media request Return value is cached and used as input for :meth:`item_media_downloaded` method. Default implementation returns ``None``. WARNING: returning the response object can eat your memory. .. method:: media_failed(failure, request, info) Method called when media request failed due to any kind of download error. Return value is cached and used as input for :meth:`item_media_failed` method. Default implementation returns same Failure object. .. method:: item_media_downloaded(result, item, request, info) Method to handle result of requested media for item. ``result`` is the return value of :meth:`media_downloaded` hook, or the non-Failure instance returned by :meth:`media_failed` hook. Return value of this method isn't important and is recommended to return ``None``. .. method:: item_media_failed(failure, item, request, info) Method to handle failed result of requested media for item. result is the returned Failure instance of :meth:`media_failed` hook, or Failure instance of an exception raised by :meth:`media_downloaded` hook. Return value of this method isn't important and is recommended to return ``None``. .. method:: item_completed(results, item, info) Method called when all media requests for a single item has returned a result or failure. The return value of this method is used as output of pipeline stage. :meth:`item_completed` can return item itself or raise :exc:`~scrapy.core.exceptions.DropItem` exception. Default returns item