1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-03-15 13:12:09 +00:00

Move lists closer to their introducing paragraph

This commit is contained in:
Adrián Chaves 2021-02-03 11:43:18 +01:00
parent 904a50138b
commit 1773eaf5dc
6 changed files with 121 additions and 100 deletions

10
docs/_static/custom.css vendored Normal file
View File

@ -0,0 +1,10 @@
/* Move lists closer to their introducing paragraph */
.rst-content .section ol p, .rst-content .section ul p {
margin-bottom: 0px;
}
.rst-content p + ol, .rst-content p + ul {
margin-top: -18px; /* Compensates margin-top: 24px of p */
}
.rst-content dl p + ol, .rst-content dl p + ul {
margin-top: -6px; /* Compensates margin-top: 12px of p */
}

View File

@ -122,7 +122,6 @@ html_theme = 'sphinx_rtd_theme'
import sphinx_rtd_theme
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
# The style sheet to use for HTML and HTML Help pages. A file of that name
# must exist either in Sphinx' static/ path, or in one of the custom paths
# given in html_static_path.
@ -183,6 +182,10 @@ html_copy_source = True
# Output file base name for HTML help builder.
htmlhelp_basename = 'Scrapydoc'
html_css_files = [
'custom.css',
]
# Options for LaTeX output
# ------------------------

View File

@ -64,10 +64,10 @@ NotConfigured
This exception can be raised by some components to indicate that they will
remain disabled. Those components include:
* Extensions
* Item pipelines
* Downloader middlewares
* Spider middlewares
- Extensions
- Item pipelines
- Downloader middlewares
- Spider middlewares
The exception must be raised in the component's ``__init__`` method.

View File

@ -21,10 +21,10 @@ Serialization formats
For serializing the scraped data, the feed exports use the :ref:`Item exporters
<topics-exporters>`. These formats are supported out of the box:
* :ref:`topics-feed-format-json`
* :ref:`topics-feed-format-jsonlines`
* :ref:`topics-feed-format-csv`
* :ref:`topics-feed-format-xml`
- :ref:`topics-feed-format-json`
- :ref:`topics-feed-format-jsonlines`
- :ref:`topics-feed-format-csv`
- :ref:`topics-feed-format-xml`
But you can also extend the supported format through the
:setting:`FEED_EXPORTERS` setting.
@ -34,54 +34,58 @@ But you can also extend the supported format through the
JSON
----
* Value for the ``format`` key in the :setting:`FEEDS` setting: ``json``
* Exporter used: :class:`~scrapy.exporters.JsonItemExporter`
* See :ref:`this warning <json-with-large-data>` if you're using JSON with
large feeds.
- Value for the ``format`` key in the :setting:`FEEDS` setting: ``json``
- Exporter used: :class:`~scrapy.exporters.JsonItemExporter`
- See :ref:`this warning <json-with-large-data>` if you're using JSON with
large feeds.
.. _topics-feed-format-jsonlines:
JSON lines
----------
* Value for the ``format`` key in the :setting:`FEEDS` setting: ``jsonlines``
* Exporter used: :class:`~scrapy.exporters.JsonLinesItemExporter`
- Value for the ``format`` key in the :setting:`FEEDS` setting: ``jsonlines``
- Exporter used: :class:`~scrapy.exporters.JsonLinesItemExporter`
.. _topics-feed-format-csv:
CSV
---
* Value for the ``format`` key in the :setting:`FEEDS` setting: ``csv``
* Exporter used: :class:`~scrapy.exporters.CsvItemExporter`
* To specify columns to export and their order use
:setting:`FEED_EXPORT_FIELDS`. Other feed exporters can also use this
option, but it is important for CSV because unlike many other export
formats CSV uses a fixed header.
- Value for the ``format`` key in the :setting:`FEEDS` setting: ``csv``
- Exporter used: :class:`~scrapy.exporters.CsvItemExporter`
- To specify columns to export and their order use
:setting:`FEED_EXPORT_FIELDS`. Other feed exporters can also use this
option, but it is important for CSV because unlike many other export
formats CSV uses a fixed header.
.. _topics-feed-format-xml:
XML
---
* Value for the ``format`` key in the :setting:`FEEDS` setting: ``xml``
* Exporter used: :class:`~scrapy.exporters.XmlItemExporter`
- Value for the ``format`` key in the :setting:`FEEDS` setting: ``xml``
- Exporter used: :class:`~scrapy.exporters.XmlItemExporter`
.. _topics-feed-format-pickle:
Pickle
------
* Value for the ``format`` key in the :setting:`FEEDS` setting: ``pickle``
* Exporter used: :class:`~scrapy.exporters.PickleItemExporter`
- Value for the ``format`` key in the :setting:`FEEDS` setting: ``pickle``
- Exporter used: :class:`~scrapy.exporters.PickleItemExporter`
.. _topics-feed-format-marshal:
Marshal
-------
* Value for the ``format`` key in the :setting:`FEEDS` setting: ``marshal``
* Exporter used: :class:`~scrapy.exporters.MarshalItemExporter`
- Value for the ``format`` key in the :setting:`FEEDS` setting: ``marshal``
- Exporter used: :class:`~scrapy.exporters.MarshalItemExporter`
.. _topics-feed-storage:
@ -95,11 +99,11 @@ storage backend types which are defined by the URI scheme.
The storages backends supported out of the box are:
* :ref:`topics-feed-storage-fs`
* :ref:`topics-feed-storage-ftp`
* :ref:`topics-feed-storage-s3` (requires botocore_)
* :ref:`topics-feed-storage-gcs` (requires `google-cloud-storage`_)
* :ref:`topics-feed-storage-stdout`
- :ref:`topics-feed-storage-fs`
- :ref:`topics-feed-storage-ftp`
- :ref:`topics-feed-storage-s3` (requires botocore_)
- :ref:`topics-feed-storage-gcs` (requires `google-cloud-storage`_)
- :ref:`topics-feed-storage-stdout`
Some storage backends may be unavailable if the required external libraries are
not available. For example, the S3 backend is only available if the botocore_
@ -114,8 +118,8 @@ Storage URI parameters
The storage URI can also contain parameters that get replaced when the feed is
being created. These parameters are:
* ``%(time)s`` - gets replaced by a timestamp when the feed is being created
* ``%(name)s`` - gets replaced by the spider name
- ``%(time)s`` - gets replaced by a timestamp when the feed is being created
- ``%(name)s`` - gets replaced by the spider name
Any other named parameter gets replaced by the spider attribute of the same
name. For example, ``%(site_id)s`` would get replaced by the ``spider.site_id``
@ -123,13 +127,13 @@ attribute the moment the feed is being created.
Here are some examples to illustrate:
* Store in FTP using one directory per spider:
- Store in FTP using one directory per spider:
* ``ftp://user:password@ftp.example.com/scraping/feeds/%(name)s/%(time)s.json``
- ``ftp://user:password@ftp.example.com/scraping/feeds/%(name)s/%(time)s.json``
* Store in S3 using one directory per spider:
- Store in S3 using one directory per spider:
* ``s3://mybucket/scraping/feeds/%(name)s/%(time)s.json``
- ``s3://mybucket/scraping/feeds/%(name)s/%(time)s.json``
.. _topics-feed-storage-backends:
@ -144,9 +148,9 @@ Local filesystem
The feeds are stored in the local filesystem.
* URI scheme: ``file``
* Example URI: ``file:///tmp/export.csv``
* Required external libraries: none
- URI scheme: ``file``
- Example URI: ``file:///tmp/export.csv``
- Required external libraries: none
Note that for the local filesystem storage (only) you can omit the scheme if
you specify an absolute path like ``/tmp/export.csv``. This only works on Unix
@ -159,9 +163,9 @@ FTP
The feeds are stored in a FTP server.
* URI scheme: ``ftp``
* Example URI: ``ftp://user:pass@ftp.example.com/path/to/export.csv``
* Required external libraries: none
- URI scheme: ``ftp``
- Example URI: ``ftp://user:pass@ftp.example.com/path/to/export.csv``
- Required external libraries: none
FTP supports two different connection modes: `active or passive
<https://stackoverflow.com/a/1699163>`_. Scrapy uses the passive connection
@ -178,23 +182,25 @@ S3
The feeds are stored on `Amazon S3`_.
* URI scheme: ``s3``
* Example URIs:
- URI scheme: ``s3``
* ``s3://mybucket/path/to/export.csv``
* ``s3://aws_key:aws_secret@mybucket/path/to/export.csv``
- Example URIs:
* Required external libraries: `botocore`_ >= 1.4.87
- ``s3://mybucket/path/to/export.csv``
- ``s3://aws_key:aws_secret@mybucket/path/to/export.csv``
- Required external libraries: `botocore`_ >= 1.4.87
The AWS credentials can be passed as user/password in the URI, or they can be
passed through the following settings:
* :setting:`AWS_ACCESS_KEY_ID`
* :setting:`AWS_SECRET_ACCESS_KEY`
- :setting:`AWS_ACCESS_KEY_ID`
- :setting:`AWS_SECRET_ACCESS_KEY`
You can also define a custom ACL for exported feeds using this setting:
* :setting:`FEED_STORAGE_S3_ACL`
- :setting:`FEED_STORAGE_S3_ACL`
This storage backend uses :ref:`delayed file delivery <delayed-file-delivery>`.
@ -208,19 +214,20 @@ Google Cloud Storage (GCS)
The feeds are stored on `Google Cloud Storage`_.
* URI scheme: ``gs``
* Example URIs:
- URI scheme: ``gs``
* ``gs://mybucket/path/to/export.csv``
- Example URIs:
* Required external libraries: `google-cloud-storage`_.
- ``gs://mybucket/path/to/export.csv``
- Required external libraries: `google-cloud-storage`_.
For more information about authentication, please refer to `Google Cloud documentation <https://cloud.google.com/docs/authentication/production>`_.
You can set a *Project ID* and *Access Control List (ACL)* through the following settings:
* :setting:`FEED_STORAGE_GCS_ACL`
* :setting:`GCS_PROJECT_ID`
- :setting:`FEED_STORAGE_GCS_ACL`
- :setting:`GCS_PROJECT_ID`
This storage backend uses :ref:`delayed file delivery <delayed-file-delivery>`.
@ -234,9 +241,9 @@ Standard output
The feeds are written to the standard output of the Scrapy process.
* URI scheme: ``stdout``
* Example URI: ``stdout:``
* Required external libraries: none
- URI scheme: ``stdout``
- Example URI: ``stdout:``
- Required external libraries: none
.. _delayed-file-delivery:
@ -264,16 +271,16 @@ Settings
These are the settings used for configuring the feed exports:
* :setting:`FEEDS` (mandatory)
* :setting:`FEED_EXPORT_ENCODING`
* :setting:`FEED_STORE_EMPTY`
* :setting:`FEED_EXPORT_FIELDS`
* :setting:`FEED_EXPORT_INDENT`
* :setting:`FEED_STORAGES`
* :setting:`FEED_STORAGE_FTP_ACTIVE`
* :setting:`FEED_STORAGE_S3_ACL`
* :setting:`FEED_EXPORTERS`
* :setting:`FEED_EXPORT_BATCH_ITEM_COUNT`
- :setting:`FEEDS` (mandatory)
- :setting:`FEED_EXPORT_ENCODING`
- :setting:`FEED_STORE_EMPTY`
- :setting:`FEED_EXPORT_FIELDS`
- :setting:`FEED_EXPORT_INDENT`
- :setting:`FEED_STORAGES`
- :setting:`FEED_STORAGE_FTP_ACTIVE`
- :setting:`FEED_STORAGE_S3_ACL`
- :setting:`FEED_EXPORTERS`
- :setting:`FEED_EXPORT_BATCH_ITEM_COUNT`
.. currentmodule:: scrapy.extensions.feedexport

View File

@ -8,14 +8,14 @@ When you're scraping web pages, the most common task you need to perform is
to extract data from the HTML source. There are several libraries available to
achieve this, such as:
* `BeautifulSoup`_ is a very popular web scraping library among Python
programmers which constructs a Python object based on the structure of the
HTML code and also deals with bad markup reasonably well, but it has one
drawback: it's slow.
- `BeautifulSoup`_ is a very popular web scraping library among Python
programmers which constructs a Python object based on the structure of the
HTML code and also deals with bad markup reasonably well, but it has one
drawback: it's slow.
* `lxml`_ is an XML parsing library (which also parses HTML) with a pythonic
API based on :mod:`~xml.etree.ElementTree`. (lxml is not part of the Python standard
library.)
- `lxml`_ is an XML parsing library (which also parses HTML) with a pythonic
API based on :mod:`~xml.etree.ElementTree`. (lxml is not part of the Python
standard library.)
Scrapy comes with its own mechanism for extracting data. They're called
selectors because they "select" certain parts of the HTML document specified

View File

@ -95,20 +95,21 @@ convenience.
Available Shortcuts
-------------------
* ``shelp()`` - print a help with the list of available objects and shortcuts
- ``shelp()`` - print a help with the list of available objects and
shortcuts
* ``fetch(url[, redirect=True])`` - fetch a new response from the given
URL and update all related objects accordingly. You can optionaly ask for
HTTP 3xx redirections to not be followed by passing ``redirect=False``
- ``fetch(url[, redirect=True])`` - fetch a new response from the given URL
and update all related objects accordingly. You can optionaly ask for HTTP
3xx redirections to not be followed by passing ``redirect=False``
* ``fetch(request)`` - fetch a new response from the given request and
update all related objects accordingly.
- ``fetch(request)`` - fetch a new response from the given request and update
all related objects accordingly.
* ``view(response)`` - open the given response in your local web browser, for
inspection. This will add a `\<base\> tag`_ to the response body in order
for external links (such as images and style sheets) to display properly.
Note, however, that this will create a temporary file in your computer,
which won't be removed automatically.
- ``view(response)`` - open the given response in your local web browser, for
inspection. This will add a `\<base\> tag`_ to the response body in order
for external links (such as images and style sheets) to display properly.
Note, however, that this will create a temporary file in your computer,
which won't be removed automatically.
.. _<base> tag: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base
@ -122,21 +123,21 @@ content).
Those objects are:
* ``crawler`` - the current :class:`~scrapy.crawler.Crawler` object.
- ``crawler`` - the current :class:`~scrapy.crawler.Crawler` object.
* ``spider`` - the Spider which is known to handle the URL, or a
:class:`~scrapy.spiders.Spider` object if there is no spider found for
the current URL
- ``spider`` - the Spider which is known to handle the URL, or a
:class:`~scrapy.spiders.Spider` object if there is no spider found for the
current URL
* ``request`` - a :class:`~scrapy.http.Request` object of the last fetched
page. You can modify this request using :meth:`~scrapy.http.Request.replace`
or fetch a new request (without leaving the shell) using the ``fetch``
shortcut.
- ``request`` - a :class:`~scrapy.http.Request` object of the last fetched
page. You can modify this request using
:meth:`~scrapy.http.Request.replace` or fetch a new request (without
leaving the shell) using the ``fetch`` shortcut.
* ``response`` - a :class:`~scrapy.http.Response` object containing the last
fetched page
- ``response`` - a :class:`~scrapy.http.Response` object containing the last
fetched page
* ``settings`` - the current :ref:`Scrapy settings <topics-settings>`
- ``settings`` - the current :ref:`Scrapy settings <topics-settings>`
Example of shell session
========================