mirror of
https://github.com/scrapy/scrapy.git
synced 2025-03-01 10:36:43 +00:00
334 lines
10 KiB
ReStructuredText
334 lines
10 KiB
ReStructuredText
.. _topics-images:
|
|
|
|
.. module:: scrapy.contrib.pipeline.images
|
|
|
|
===============
|
|
Handling Images
|
|
===============
|
|
|
|
In Scrapy, the recommended way of handling image downloads is using the
|
|
:class:`ImagesPipeline`.
|
|
|
|
This pipeline provides convenient mechanisms to download and store images and
|
|
also the following features:
|
|
|
|
* Image format normalization (JPG)
|
|
* Image expiration
|
|
* Thumbnail creation
|
|
* Image size checking
|
|
|
|
|
|
Using a ImagesPipeline
|
|
=======================
|
|
|
|
The typical workflow of working with a :class:`ImagesPipeline` goes like this:
|
|
|
|
1. In a Spider, you obtain the URLs of the images to be downloaded and store
|
|
them in an Item.
|
|
|
|
2. An :class:`ImagesPipeline` process the Item, downloads the images and stores
|
|
back their resulting paths in the processed Item
|
|
|
|
We assume that if you're here you know how to handle the first part of the
|
|
workflow (if not, please refer to the tutorial), so let's focus on the second
|
|
part, using a :class:`ImagesPipeline`.
|
|
|
|
:class:`ImagesPipeline` is a descendant of BaseImagesPipeline which in turn is
|
|
a descendant of :class:`~scrapy.contrib.pipeline.MediaPipeline`, all of this classes provide
|
|
overrideable methods, hooks and settings to customize their behaviour.
|
|
|
|
So, for using the :class:`ImagesPipeline` you subclass it, override some
|
|
methods with custom code and set some required settings.
|
|
|
|
The first thing we need to do is tell the pipeline where to store the
|
|
downloaded images, so set :setting:`IMAGES_DIR` to a valid directory name that
|
|
will be used for this purpose::
|
|
|
|
IMAGES_DIR = '/path/to/valid/dir'
|
|
|
|
Then, as seen on the workflow, the pipeline will get the URLs of the images to
|
|
download from the item. In order to do this, you must override the
|
|
:meth:`~scrapy.contrib.pipeline.MediaPipeline.get_media_requests` method and
|
|
return a Request for each image URL::
|
|
|
|
def get_media_requests(self, item, info):
|
|
for image_url in item['image_urls']:
|
|
yield Request(image_url)
|
|
|
|
Those requests will be processed by the pipeline, downloaded an when completed
|
|
the processed results will be sent to the
|
|
:meth:`~scrapy.contrib.pipeline.MediaPipeline.item_completed` method.
|
|
|
|
The results will be a list of tuples, in wich each tuple indicates the sucess
|
|
of the downloading process and the stored image path concatenated with the
|
|
checksum of the image ::
|
|
|
|
results = [(True, 'path#checksum'), ..., (False, Failure)]
|
|
|
|
The :meth:`~scrapy.contrib.pipeline.MediaPipeline.item_completed` is also in
|
|
charge of returning the output value to be used as the output of the pipeline
|
|
stage, so we must return (or drop) the item as in any pipeline.
|
|
|
|
We will override it to store the resulting image paths (passed in results) back
|
|
in the item::
|
|
|
|
def item_completed(self, results, item, info):
|
|
item['image_paths'] = [result.split('#')[0] for succes, result in results if succes]
|
|
|
|
return item
|
|
|
|
.. note:: This is a simplification of the actual process, it will be described
|
|
with more detail in upcoming sections.
|
|
|
|
So, the complete example of our pipeline looks like this::
|
|
|
|
from scrapy.contrib.pipeline.images import ImagesPipeline
|
|
|
|
class MyImagesPipeline(ImagesPipeline):
|
|
|
|
def get_media_requests(self, item, info):
|
|
for image_url in item['image_urls']:
|
|
yield Request(image_url)
|
|
|
|
def item_completed(self, results, item, info):
|
|
item['image_paths'] = [result.split('#')[0] for succes, result in results if succes]
|
|
|
|
return item
|
|
|
|
This is the most basic use of :class:`ImagesPipeline`, see upcoming sections for more details.
|
|
|
|
|
|
.. _topics-images-expiration:
|
|
|
|
Image expiration
|
|
-----------------
|
|
|
|
XXX
|
|
|
|
.. _topics-images-thumbnails:
|
|
|
|
Creating thumbnails
|
|
-------------------
|
|
|
|
As mentioned in the features, :class:`ImagesPipeline` can create thumbnails of
|
|
the processed images.
|
|
|
|
In order use this feature you must set the :attr:`~BaseImagesPipeline.THUMBS` to
|
|
a tuple of tuples, in wich each sub-tuple is a pair of thumb_id string and a
|
|
compatible python image library size (another tuple).
|
|
|
|
See ``thumbnail`` method at http://www.pythonware.com/library/pil/handbook/image.htm.
|
|
|
|
Example::
|
|
|
|
THUMBS = (
|
|
('50', (50, 50)),
|
|
('110', (110, 110)),
|
|
('270', (270, 270))
|
|
)
|
|
|
|
|
|
When you use this feature, :class:`ImagesPipeline` will create thumbnails of
|
|
the specified sizes in ``IMAGES_DIR/thumbs/<image_id>/<thumb_id>.jpg``, where
|
|
``<image_id>`` is the ``sha1`` digest of the url of the image and
|
|
``<thumb_id>`` is the thumb_id string specified in THUMBS attribute.
|
|
|
|
Example with previous THUMB attribute::
|
|
|
|
IMAGES_DIR/thumbs/image_sha1_digest/50.jpg
|
|
IMAGES_DIR/thumbs/image_sha1_digest/110.jpg
|
|
IMAGES_DIR/thumbs/image_sha1_digest/270.jpg
|
|
|
|
|
|
.. _topics-images-size:
|
|
|
|
Checking image size
|
|
-------------------
|
|
|
|
You can skip the processing of an image if its size is less than a specified
|
|
one. To use this set :setting:`IMAGES_MIN_HEIGHT` and/or
|
|
:setting:`IMAGES_MIN_WIDTH` to your likings::
|
|
|
|
IMAGES_MIN_HEIGHT = 270
|
|
IMAGES_MIN_WIDTH = 270
|
|
|
|
|
|
.. _ref-images:
|
|
|
|
Reference
|
|
=========
|
|
|
|
ImagesPipeline
|
|
--------------
|
|
|
|
.. class:: ImagesPipeline
|
|
|
|
:class:`BaseImagesPipeline` descendant with filesystem support as
|
|
image's store backend
|
|
|
|
In order to enable this pipeline you must set :setting:`IMAGES_DIR` to a
|
|
valid dirname that will be used for storing images.
|
|
|
|
|
|
BaseImagesPipeline
|
|
------------------
|
|
|
|
.. class:: BaseImagesPipeline
|
|
|
|
:class:`~scrapy.contrib.pipeline.MediaPipeline` descendant that implements
|
|
image downloading and thumbnail generation logic.
|
|
|
|
This pipeline tries to minimize network transfers and image processing,
|
|
doing stat of the images and determining if image is new, uptodate or
|
|
expired.
|
|
|
|
`'new'` images are those that pipeline never processed and needs to be
|
|
downloaded from supplier site the first time.
|
|
|
|
`'uptodate'` images are the ones that the pipeline processed and are still
|
|
valid images.
|
|
|
|
`'expired'` images are those that pipeline already processed but the last
|
|
modification was made long time ago, so a reprocessing is recommended to
|
|
refresh it in case of change.
|
|
|
|
:setting:`IMAGES_EXPIRES` setting controls the maximun days since an image
|
|
was modified to consider it `uptodate`.
|
|
|
|
Downloaded images are skipped if sizes aren't greater than
|
|
:setting:`IMAGES_MIN_WIDTH` and :setting:`IMAGES_MIN_HEIGHT` limit. A proper
|
|
log messages will be printed.
|
|
|
|
.. attribute:: THUMBS
|
|
|
|
Thumbnail generation configuration, see :ref:`topics-images-thumbnails`
|
|
|
|
.. method:: store_image(key, image, buf, info)
|
|
|
|
Override this method with specific code to persist an image.
|
|
|
|
This method is used to persist the full image and any defined
|
|
thumbnail, one a time.
|
|
|
|
Return value is ignored.
|
|
|
|
|
|
.. method:: stat_key(key, info)
|
|
|
|
Override this method with specific code to stat an image.
|
|
|
|
This method should return and dictionary with two parameters:
|
|
|
|
* ``last_modified``: the last modification time in seconds since the epoch
|
|
* ``checksum``: the md5sum of the content of the stored image if found
|
|
|
|
If an exception is raised or ``last_modified`` is ``None``, then the image
|
|
will be re-downloaded.
|
|
|
|
If the difference in days between last_modified and now is greater than
|
|
:setting:`IMAGES_EXPIRES` settings, then the image will be re-downloaded
|
|
|
|
The checksum value is appended to returned image path after a hash sign
|
|
(#), if ``checksum`` is ``None``, then nothing is appended including the
|
|
hash sign.
|
|
|
|
|
|
.. module:: scrapy.contrib.pipeline
|
|
|
|
MediaPipeline
|
|
-------------
|
|
|
|
.. class:: MediaPipeline
|
|
|
|
Generic pipeline that handles the media associated with an item.
|
|
|
|
.. method:: download(request, info)
|
|
|
|
Defines how to request the download of media.
|
|
|
|
Default gives high priority to media requests and use scheduler, shouldn't
|
|
be necessary to override.
|
|
|
|
This methods is called only if result for request isn't cached, request
|
|
fingerprint is used as cache key.
|
|
|
|
|
|
.. method:: media_to_download(request, info)
|
|
|
|
Ongoing request hook pre-cache.
|
|
|
|
This method is called every time a media is requested for download, and only
|
|
once for the same request because return value is cached as media result.
|
|
|
|
Returning a non-None value implies:
|
|
|
|
* the return value is cached and piped into :meth:`item_media_downloaded`
|
|
or :meth:`item_media_failed`
|
|
* prevents downloading, this means calling :meth:`download` method.
|
|
* :meth:`media_downloaded` or :meth:`media_failed` isn't called.
|
|
|
|
|
|
.. method:: get_media_requests(item, info)
|
|
|
|
Return a list of Request objects to download for this item.
|
|
|
|
Should return ``None`` or an iterable.
|
|
|
|
Defaults return ``None`` (no media to download)
|
|
|
|
|
|
.. method:: media_downloaded(response, request, info)
|
|
|
|
Method called on success download of media request
|
|
|
|
Return value is cached and used as input for
|
|
:meth:`item_media_downloaded` method. Default implementation returns
|
|
``None``.
|
|
|
|
WARNING: returning the response object can eat your memory.
|
|
|
|
|
|
.. method:: media_failed(failure, request, info)
|
|
|
|
Method called when media request failed due to any kind of download error.
|
|
|
|
Return value is cached and used as input for :meth:`item_media_failed` method.
|
|
|
|
Default implementation returns same Failure object.
|
|
|
|
|
|
.. method:: item_media_downloaded(result, item, request, info)
|
|
|
|
Method to handle result of requested media for item.
|
|
|
|
``result`` is the return value of :meth:`media_downloaded` hook, or the
|
|
non-Failure instance returned by :meth:`media_failed` hook.
|
|
|
|
Return value of this method isn't important and is recommended to return
|
|
``None``.
|
|
|
|
|
|
.. method:: item_media_failed(failure, item, request, info)
|
|
|
|
Method to handle failed result of requested media for item.
|
|
|
|
result is the returned Failure instance of :meth:`media_failed` hook, or Failure
|
|
instance of an exception raised by :meth:`media_downloaded` hook.
|
|
|
|
Return value of this method isn't important and is recommended to return
|
|
``None``.
|
|
|
|
|
|
.. method:: item_completed(results, item, info)
|
|
|
|
Method called when all media requests for a single item has returned a result
|
|
or failure.
|
|
|
|
The return value of this method is used as output of pipeline stage.
|
|
|
|
:meth:`item_completed` can return item itself or raise
|
|
:exc:`~scrapy.core.exceptions.DropItem` exception.
|
|
|
|
Default returns item
|
|
|