mirror of
https://github.com/scrapy/scrapy.git
synced 2025-03-15 13:12:09 +00:00
Move lists closer to their introducing paragraph
This commit is contained in:
parent
904a50138b
commit
1773eaf5dc
10
docs/_static/custom.css
vendored
Normal file
10
docs/_static/custom.css
vendored
Normal file
@ -0,0 +1,10 @@
|
||||
/* Move lists closer to their introducing paragraph */
|
||||
.rst-content .section ol p, .rst-content .section ul p {
|
||||
margin-bottom: 0px;
|
||||
}
|
||||
.rst-content p + ol, .rst-content p + ul {
|
||||
margin-top: -18px; /* Compensates margin-top: 24px of p */
|
||||
}
|
||||
.rst-content dl p + ol, .rst-content dl p + ul {
|
||||
margin-top: -6px; /* Compensates margin-top: 12px of p */
|
||||
}
|
@ -122,7 +122,6 @@ html_theme = 'sphinx_rtd_theme'
|
||||
import sphinx_rtd_theme
|
||||
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
|
||||
|
||||
|
||||
# The style sheet to use for HTML and HTML Help pages. A file of that name
|
||||
# must exist either in Sphinx' static/ path, or in one of the custom paths
|
||||
# given in html_static_path.
|
||||
@ -183,6 +182,10 @@ html_copy_source = True
|
||||
# Output file base name for HTML help builder.
|
||||
htmlhelp_basename = 'Scrapydoc'
|
||||
|
||||
html_css_files = [
|
||||
'custom.css',
|
||||
]
|
||||
|
||||
|
||||
# Options for LaTeX output
|
||||
# ------------------------
|
||||
|
@ -64,10 +64,10 @@ NotConfigured
|
||||
This exception can be raised by some components to indicate that they will
|
||||
remain disabled. Those components include:
|
||||
|
||||
* Extensions
|
||||
* Item pipelines
|
||||
* Downloader middlewares
|
||||
* Spider middlewares
|
||||
- Extensions
|
||||
- Item pipelines
|
||||
- Downloader middlewares
|
||||
- Spider middlewares
|
||||
|
||||
The exception must be raised in the component's ``__init__`` method.
|
||||
|
||||
|
@ -21,10 +21,10 @@ Serialization formats
|
||||
For serializing the scraped data, the feed exports use the :ref:`Item exporters
|
||||
<topics-exporters>`. These formats are supported out of the box:
|
||||
|
||||
* :ref:`topics-feed-format-json`
|
||||
* :ref:`topics-feed-format-jsonlines`
|
||||
* :ref:`topics-feed-format-csv`
|
||||
* :ref:`topics-feed-format-xml`
|
||||
- :ref:`topics-feed-format-json`
|
||||
- :ref:`topics-feed-format-jsonlines`
|
||||
- :ref:`topics-feed-format-csv`
|
||||
- :ref:`topics-feed-format-xml`
|
||||
|
||||
But you can also extend the supported format through the
|
||||
:setting:`FEED_EXPORTERS` setting.
|
||||
@ -34,54 +34,58 @@ But you can also extend the supported format through the
|
||||
JSON
|
||||
----
|
||||
|
||||
* Value for the ``format`` key in the :setting:`FEEDS` setting: ``json``
|
||||
* Exporter used: :class:`~scrapy.exporters.JsonItemExporter`
|
||||
* See :ref:`this warning <json-with-large-data>` if you're using JSON with
|
||||
large feeds.
|
||||
- Value for the ``format`` key in the :setting:`FEEDS` setting: ``json``
|
||||
|
||||
- Exporter used: :class:`~scrapy.exporters.JsonItemExporter`
|
||||
|
||||
- See :ref:`this warning <json-with-large-data>` if you're using JSON with
|
||||
large feeds.
|
||||
|
||||
.. _topics-feed-format-jsonlines:
|
||||
|
||||
JSON lines
|
||||
----------
|
||||
|
||||
* Value for the ``format`` key in the :setting:`FEEDS` setting: ``jsonlines``
|
||||
* Exporter used: :class:`~scrapy.exporters.JsonLinesItemExporter`
|
||||
- Value for the ``format`` key in the :setting:`FEEDS` setting: ``jsonlines``
|
||||
- Exporter used: :class:`~scrapy.exporters.JsonLinesItemExporter`
|
||||
|
||||
.. _topics-feed-format-csv:
|
||||
|
||||
CSV
|
||||
---
|
||||
|
||||
* Value for the ``format`` key in the :setting:`FEEDS` setting: ``csv``
|
||||
* Exporter used: :class:`~scrapy.exporters.CsvItemExporter`
|
||||
* To specify columns to export and their order use
|
||||
:setting:`FEED_EXPORT_FIELDS`. Other feed exporters can also use this
|
||||
option, but it is important for CSV because unlike many other export
|
||||
formats CSV uses a fixed header.
|
||||
- Value for the ``format`` key in the :setting:`FEEDS` setting: ``csv``
|
||||
|
||||
- Exporter used: :class:`~scrapy.exporters.CsvItemExporter`
|
||||
|
||||
- To specify columns to export and their order use
|
||||
:setting:`FEED_EXPORT_FIELDS`. Other feed exporters can also use this
|
||||
option, but it is important for CSV because unlike many other export
|
||||
formats CSV uses a fixed header.
|
||||
|
||||
.. _topics-feed-format-xml:
|
||||
|
||||
XML
|
||||
---
|
||||
|
||||
* Value for the ``format`` key in the :setting:`FEEDS` setting: ``xml``
|
||||
* Exporter used: :class:`~scrapy.exporters.XmlItemExporter`
|
||||
- Value for the ``format`` key in the :setting:`FEEDS` setting: ``xml``
|
||||
- Exporter used: :class:`~scrapy.exporters.XmlItemExporter`
|
||||
|
||||
.. _topics-feed-format-pickle:
|
||||
|
||||
Pickle
|
||||
------
|
||||
|
||||
* Value for the ``format`` key in the :setting:`FEEDS` setting: ``pickle``
|
||||
* Exporter used: :class:`~scrapy.exporters.PickleItemExporter`
|
||||
- Value for the ``format`` key in the :setting:`FEEDS` setting: ``pickle``
|
||||
- Exporter used: :class:`~scrapy.exporters.PickleItemExporter`
|
||||
|
||||
.. _topics-feed-format-marshal:
|
||||
|
||||
Marshal
|
||||
-------
|
||||
|
||||
* Value for the ``format`` key in the :setting:`FEEDS` setting: ``marshal``
|
||||
* Exporter used: :class:`~scrapy.exporters.MarshalItemExporter`
|
||||
- Value for the ``format`` key in the :setting:`FEEDS` setting: ``marshal``
|
||||
- Exporter used: :class:`~scrapy.exporters.MarshalItemExporter`
|
||||
|
||||
|
||||
.. _topics-feed-storage:
|
||||
@ -95,11 +99,11 @@ storage backend types which are defined by the URI scheme.
|
||||
|
||||
The storages backends supported out of the box are:
|
||||
|
||||
* :ref:`topics-feed-storage-fs`
|
||||
* :ref:`topics-feed-storage-ftp`
|
||||
* :ref:`topics-feed-storage-s3` (requires botocore_)
|
||||
* :ref:`topics-feed-storage-gcs` (requires `google-cloud-storage`_)
|
||||
* :ref:`topics-feed-storage-stdout`
|
||||
- :ref:`topics-feed-storage-fs`
|
||||
- :ref:`topics-feed-storage-ftp`
|
||||
- :ref:`topics-feed-storage-s3` (requires botocore_)
|
||||
- :ref:`topics-feed-storage-gcs` (requires `google-cloud-storage`_)
|
||||
- :ref:`topics-feed-storage-stdout`
|
||||
|
||||
Some storage backends may be unavailable if the required external libraries are
|
||||
not available. For example, the S3 backend is only available if the botocore_
|
||||
@ -114,8 +118,8 @@ Storage URI parameters
|
||||
The storage URI can also contain parameters that get replaced when the feed is
|
||||
being created. These parameters are:
|
||||
|
||||
* ``%(time)s`` - gets replaced by a timestamp when the feed is being created
|
||||
* ``%(name)s`` - gets replaced by the spider name
|
||||
- ``%(time)s`` - gets replaced by a timestamp when the feed is being created
|
||||
- ``%(name)s`` - gets replaced by the spider name
|
||||
|
||||
Any other named parameter gets replaced by the spider attribute of the same
|
||||
name. For example, ``%(site_id)s`` would get replaced by the ``spider.site_id``
|
||||
@ -123,13 +127,13 @@ attribute the moment the feed is being created.
|
||||
|
||||
Here are some examples to illustrate:
|
||||
|
||||
* Store in FTP using one directory per spider:
|
||||
- Store in FTP using one directory per spider:
|
||||
|
||||
* ``ftp://user:password@ftp.example.com/scraping/feeds/%(name)s/%(time)s.json``
|
||||
- ``ftp://user:password@ftp.example.com/scraping/feeds/%(name)s/%(time)s.json``
|
||||
|
||||
* Store in S3 using one directory per spider:
|
||||
- Store in S3 using one directory per spider:
|
||||
|
||||
* ``s3://mybucket/scraping/feeds/%(name)s/%(time)s.json``
|
||||
- ``s3://mybucket/scraping/feeds/%(name)s/%(time)s.json``
|
||||
|
||||
|
||||
.. _topics-feed-storage-backends:
|
||||
@ -144,9 +148,9 @@ Local filesystem
|
||||
|
||||
The feeds are stored in the local filesystem.
|
||||
|
||||
* URI scheme: ``file``
|
||||
* Example URI: ``file:///tmp/export.csv``
|
||||
* Required external libraries: none
|
||||
- URI scheme: ``file``
|
||||
- Example URI: ``file:///tmp/export.csv``
|
||||
- Required external libraries: none
|
||||
|
||||
Note that for the local filesystem storage (only) you can omit the scheme if
|
||||
you specify an absolute path like ``/tmp/export.csv``. This only works on Unix
|
||||
@ -159,9 +163,9 @@ FTP
|
||||
|
||||
The feeds are stored in a FTP server.
|
||||
|
||||
* URI scheme: ``ftp``
|
||||
* Example URI: ``ftp://user:pass@ftp.example.com/path/to/export.csv``
|
||||
* Required external libraries: none
|
||||
- URI scheme: ``ftp``
|
||||
- Example URI: ``ftp://user:pass@ftp.example.com/path/to/export.csv``
|
||||
- Required external libraries: none
|
||||
|
||||
FTP supports two different connection modes: `active or passive
|
||||
<https://stackoverflow.com/a/1699163>`_. Scrapy uses the passive connection
|
||||
@ -178,23 +182,25 @@ S3
|
||||
|
||||
The feeds are stored on `Amazon S3`_.
|
||||
|
||||
* URI scheme: ``s3``
|
||||
* Example URIs:
|
||||
- URI scheme: ``s3``
|
||||
|
||||
* ``s3://mybucket/path/to/export.csv``
|
||||
* ``s3://aws_key:aws_secret@mybucket/path/to/export.csv``
|
||||
- Example URIs:
|
||||
|
||||
* Required external libraries: `botocore`_ >= 1.4.87
|
||||
- ``s3://mybucket/path/to/export.csv``
|
||||
|
||||
- ``s3://aws_key:aws_secret@mybucket/path/to/export.csv``
|
||||
|
||||
- Required external libraries: `botocore`_ >= 1.4.87
|
||||
|
||||
The AWS credentials can be passed as user/password in the URI, or they can be
|
||||
passed through the following settings:
|
||||
|
||||
* :setting:`AWS_ACCESS_KEY_ID`
|
||||
* :setting:`AWS_SECRET_ACCESS_KEY`
|
||||
- :setting:`AWS_ACCESS_KEY_ID`
|
||||
- :setting:`AWS_SECRET_ACCESS_KEY`
|
||||
|
||||
You can also define a custom ACL for exported feeds using this setting:
|
||||
|
||||
* :setting:`FEED_STORAGE_S3_ACL`
|
||||
- :setting:`FEED_STORAGE_S3_ACL`
|
||||
|
||||
This storage backend uses :ref:`delayed file delivery <delayed-file-delivery>`.
|
||||
|
||||
@ -208,19 +214,20 @@ Google Cloud Storage (GCS)
|
||||
|
||||
The feeds are stored on `Google Cloud Storage`_.
|
||||
|
||||
* URI scheme: ``gs``
|
||||
* Example URIs:
|
||||
- URI scheme: ``gs``
|
||||
|
||||
* ``gs://mybucket/path/to/export.csv``
|
||||
- Example URIs:
|
||||
|
||||
* Required external libraries: `google-cloud-storage`_.
|
||||
- ``gs://mybucket/path/to/export.csv``
|
||||
|
||||
- Required external libraries: `google-cloud-storage`_.
|
||||
|
||||
For more information about authentication, please refer to `Google Cloud documentation <https://cloud.google.com/docs/authentication/production>`_.
|
||||
|
||||
You can set a *Project ID* and *Access Control List (ACL)* through the following settings:
|
||||
|
||||
* :setting:`FEED_STORAGE_GCS_ACL`
|
||||
* :setting:`GCS_PROJECT_ID`
|
||||
- :setting:`FEED_STORAGE_GCS_ACL`
|
||||
- :setting:`GCS_PROJECT_ID`
|
||||
|
||||
This storage backend uses :ref:`delayed file delivery <delayed-file-delivery>`.
|
||||
|
||||
@ -234,9 +241,9 @@ Standard output
|
||||
|
||||
The feeds are written to the standard output of the Scrapy process.
|
||||
|
||||
* URI scheme: ``stdout``
|
||||
* Example URI: ``stdout:``
|
||||
* Required external libraries: none
|
||||
- URI scheme: ``stdout``
|
||||
- Example URI: ``stdout:``
|
||||
- Required external libraries: none
|
||||
|
||||
|
||||
.. _delayed-file-delivery:
|
||||
@ -264,16 +271,16 @@ Settings
|
||||
|
||||
These are the settings used for configuring the feed exports:
|
||||
|
||||
* :setting:`FEEDS` (mandatory)
|
||||
* :setting:`FEED_EXPORT_ENCODING`
|
||||
* :setting:`FEED_STORE_EMPTY`
|
||||
* :setting:`FEED_EXPORT_FIELDS`
|
||||
* :setting:`FEED_EXPORT_INDENT`
|
||||
* :setting:`FEED_STORAGES`
|
||||
* :setting:`FEED_STORAGE_FTP_ACTIVE`
|
||||
* :setting:`FEED_STORAGE_S3_ACL`
|
||||
* :setting:`FEED_EXPORTERS`
|
||||
* :setting:`FEED_EXPORT_BATCH_ITEM_COUNT`
|
||||
- :setting:`FEEDS` (mandatory)
|
||||
- :setting:`FEED_EXPORT_ENCODING`
|
||||
- :setting:`FEED_STORE_EMPTY`
|
||||
- :setting:`FEED_EXPORT_FIELDS`
|
||||
- :setting:`FEED_EXPORT_INDENT`
|
||||
- :setting:`FEED_STORAGES`
|
||||
- :setting:`FEED_STORAGE_FTP_ACTIVE`
|
||||
- :setting:`FEED_STORAGE_S3_ACL`
|
||||
- :setting:`FEED_EXPORTERS`
|
||||
- :setting:`FEED_EXPORT_BATCH_ITEM_COUNT`
|
||||
|
||||
.. currentmodule:: scrapy.extensions.feedexport
|
||||
|
||||
|
@ -8,14 +8,14 @@ When you're scraping web pages, the most common task you need to perform is
|
||||
to extract data from the HTML source. There are several libraries available to
|
||||
achieve this, such as:
|
||||
|
||||
* `BeautifulSoup`_ is a very popular web scraping library among Python
|
||||
programmers which constructs a Python object based on the structure of the
|
||||
HTML code and also deals with bad markup reasonably well, but it has one
|
||||
drawback: it's slow.
|
||||
- `BeautifulSoup`_ is a very popular web scraping library among Python
|
||||
programmers which constructs a Python object based on the structure of the
|
||||
HTML code and also deals with bad markup reasonably well, but it has one
|
||||
drawback: it's slow.
|
||||
|
||||
* `lxml`_ is an XML parsing library (which also parses HTML) with a pythonic
|
||||
API based on :mod:`~xml.etree.ElementTree`. (lxml is not part of the Python standard
|
||||
library.)
|
||||
- `lxml`_ is an XML parsing library (which also parses HTML) with a pythonic
|
||||
API based on :mod:`~xml.etree.ElementTree`. (lxml is not part of the Python
|
||||
standard library.)
|
||||
|
||||
Scrapy comes with its own mechanism for extracting data. They're called
|
||||
selectors because they "select" certain parts of the HTML document specified
|
||||
|
@ -95,20 +95,21 @@ convenience.
|
||||
Available Shortcuts
|
||||
-------------------
|
||||
|
||||
* ``shelp()`` - print a help with the list of available objects and shortcuts
|
||||
- ``shelp()`` - print a help with the list of available objects and
|
||||
shortcuts
|
||||
|
||||
* ``fetch(url[, redirect=True])`` - fetch a new response from the given
|
||||
URL and update all related objects accordingly. You can optionaly ask for
|
||||
HTTP 3xx redirections to not be followed by passing ``redirect=False``
|
||||
- ``fetch(url[, redirect=True])`` - fetch a new response from the given URL
|
||||
and update all related objects accordingly. You can optionaly ask for HTTP
|
||||
3xx redirections to not be followed by passing ``redirect=False``
|
||||
|
||||
* ``fetch(request)`` - fetch a new response from the given request and
|
||||
update all related objects accordingly.
|
||||
- ``fetch(request)`` - fetch a new response from the given request and update
|
||||
all related objects accordingly.
|
||||
|
||||
* ``view(response)`` - open the given response in your local web browser, for
|
||||
inspection. This will add a `\<base\> tag`_ to the response body in order
|
||||
for external links (such as images and style sheets) to display properly.
|
||||
Note, however, that this will create a temporary file in your computer,
|
||||
which won't be removed automatically.
|
||||
- ``view(response)`` - open the given response in your local web browser, for
|
||||
inspection. This will add a `\<base\> tag`_ to the response body in order
|
||||
for external links (such as images and style sheets) to display properly.
|
||||
Note, however, that this will create a temporary file in your computer,
|
||||
which won't be removed automatically.
|
||||
|
||||
.. _<base> tag: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base
|
||||
|
||||
@ -122,21 +123,21 @@ content).
|
||||
|
||||
Those objects are:
|
||||
|
||||
* ``crawler`` - the current :class:`~scrapy.crawler.Crawler` object.
|
||||
- ``crawler`` - the current :class:`~scrapy.crawler.Crawler` object.
|
||||
|
||||
* ``spider`` - the Spider which is known to handle the URL, or a
|
||||
:class:`~scrapy.spiders.Spider` object if there is no spider found for
|
||||
the current URL
|
||||
- ``spider`` - the Spider which is known to handle the URL, or a
|
||||
:class:`~scrapy.spiders.Spider` object if there is no spider found for the
|
||||
current URL
|
||||
|
||||
* ``request`` - a :class:`~scrapy.http.Request` object of the last fetched
|
||||
page. You can modify this request using :meth:`~scrapy.http.Request.replace`
|
||||
or fetch a new request (without leaving the shell) using the ``fetch``
|
||||
shortcut.
|
||||
- ``request`` - a :class:`~scrapy.http.Request` object of the last fetched
|
||||
page. You can modify this request using
|
||||
:meth:`~scrapy.http.Request.replace` or fetch a new request (without
|
||||
leaving the shell) using the ``fetch`` shortcut.
|
||||
|
||||
* ``response`` - a :class:`~scrapy.http.Response` object containing the last
|
||||
fetched page
|
||||
- ``response`` - a :class:`~scrapy.http.Response` object containing the last
|
||||
fetched page
|
||||
|
||||
* ``settings`` - the current :ref:`Scrapy settings <topics-settings>`
|
||||
- ``settings`` - the current :ref:`Scrapy settings <topics-settings>`
|
||||
|
||||
Example of shell session
|
||||
========================
|
||||
|
Loading…
x
Reference in New Issue
Block a user