mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-23 08:03:53 +00:00
Fixed minor grammar issues.
This commit is contained in:
parent
57f87b95d4
commit
0025d5a943
@ -144,7 +144,7 @@ I get "Filtered offsite request" messages. How can I fix them?
|
|||||||
Those messages (logged with ``DEBUG`` level) don't necessarily mean there is a
|
Those messages (logged with ``DEBUG`` level) don't necessarily mean there is a
|
||||||
problem, so you may not need to fix them.
|
problem, so you may not need to fix them.
|
||||||
|
|
||||||
Those message are thrown by the Offsite Spider Middleware, which is a spider
|
Those messages are thrown by the Offsite Spider Middleware, which is a spider
|
||||||
middleware (enabled by default) whose purpose is to filter out requests to
|
middleware (enabled by default) whose purpose is to filter out requests to
|
||||||
domains outside the ones covered by the spider.
|
domains outside the ones covered by the spider.
|
||||||
|
|
||||||
|
@ -34,7 +34,7 @@ These are some common properties often found in broad crawls:
|
|||||||
|
|
||||||
As said above, Scrapy default settings are optimized for focused crawls, not
|
As said above, Scrapy default settings are optimized for focused crawls, not
|
||||||
broad crawls. However, due to its asynchronous architecture, Scrapy is very
|
broad crawls. However, due to its asynchronous architecture, Scrapy is very
|
||||||
well suited for performing fast broad crawls. This page summarize some things
|
well suited for performing fast broad crawls. This page summarizes some things
|
||||||
you need to keep in mind when using Scrapy for doing broad crawls, along with
|
you need to keep in mind when using Scrapy for doing broad crawls, along with
|
||||||
concrete suggestions of Scrapy settings to tune in order to achieve an
|
concrete suggestions of Scrapy settings to tune in order to achieve an
|
||||||
efficient broad crawl.
|
efficient broad crawl.
|
||||||
@ -46,7 +46,7 @@ Concurrency is the number of requests that are processed in parallel. There is
|
|||||||
a global limit and a per-domain limit.
|
a global limit and a per-domain limit.
|
||||||
|
|
||||||
The default global concurrency limit in Scrapy is not suitable for crawling
|
The default global concurrency limit in Scrapy is not suitable for crawling
|
||||||
many different domains in parallel, so you will want to increase it. How much
|
many different domains in parallel, so you will want to increase it. How much
|
||||||
to increase it will depend on how much CPU you crawler will have available. A
|
to increase it will depend on how much CPU you crawler will have available. A
|
||||||
good starting point is ``100``, but the best way to find out is by doing some
|
good starting point is ``100``, but the best way to find out is by doing some
|
||||||
trials and identifying at what concurrency your Scrapy process gets CPU
|
trials and identifying at what concurrency your Scrapy process gets CPU
|
||||||
|
@ -17,7 +17,7 @@ Extensions use the :ref:`Scrapy settings <topics-settings>` to manage their
|
|||||||
settings, just like any other Scrapy code.
|
settings, just like any other Scrapy code.
|
||||||
|
|
||||||
It is customary for extensions to prefix their settings with their own name, to
|
It is customary for extensions to prefix their settings with their own name, to
|
||||||
avoid collision with existing (and future) extensions. For example, an
|
avoid collision with existing (and future) extensions. For example, a
|
||||||
hypothetic extension to handle `Google Sitemaps`_ would use settings like
|
hypothetic extension to handle `Google Sitemaps`_ would use settings like
|
||||||
`GOOGLESITEMAP_ENABLED`, `GOOGLESITEMAP_DEPTH`, and so on.
|
`GOOGLESITEMAP_ENABLED`, `GOOGLESITEMAP_DEPTH`, and so on.
|
||||||
|
|
||||||
|
@ -95,7 +95,7 @@ contain a price::
|
|||||||
Write items to a JSON file
|
Write items to a JSON file
|
||||||
--------------------------
|
--------------------------
|
||||||
|
|
||||||
The following pipeline stores all scraped items (from all spiders) into a a
|
The following pipeline stores all scraped items (from all spiders) into a
|
||||||
single ``items.jl`` file, containing one item per line serialized in JSON
|
single ``items.jl`` file, containing one item per line serialized in JSON
|
||||||
format::
|
format::
|
||||||
|
|
||||||
|
@ -61,7 +61,7 @@ the example above.
|
|||||||
You can specify any kind of metadata for each field. There is no restriction on
|
You can specify any kind of metadata for each field. There is no restriction on
|
||||||
the values accepted by :class:`Field` objects. For this same
|
the values accepted by :class:`Field` objects. For this same
|
||||||
reason, there is no reference list of all available metadata keys. Each key
|
reason, there is no reference list of all available metadata keys. Each key
|
||||||
defined in :class:`Field` objects could be used by a different components, and
|
defined in :class:`Field` objects could be used by a different component, and
|
||||||
only those components know about it. You can also define and use any other
|
only those components know about it. You can also define and use any other
|
||||||
:class:`Field` key in your project too, for your own needs. The main goal of
|
:class:`Field` key in your project too, for your own needs. The main goal of
|
||||||
:class:`Field` objects is to provide a way to define all field metadata in one
|
:class:`Field` objects is to provide a way to define all field metadata in one
|
||||||
|
@ -97,7 +97,7 @@ subclasses):
|
|||||||
A real example
|
A real example
|
||||||
--------------
|
--------------
|
||||||
|
|
||||||
Let's see a concrete example of an hypothetical case of memory leaks.
|
Let's see a concrete example of a hypothetical case of memory leaks.
|
||||||
Suppose we have some spider with a line similar to this one::
|
Suppose we have some spider with a line similar to this one::
|
||||||
|
|
||||||
return Request("http://www.somenastyspider.com/product.php?pid=%d" % product_id,
|
return Request("http://www.somenastyspider.com/product.php?pid=%d" % product_id,
|
||||||
|
@ -228,7 +228,7 @@ with varying degrees of sophistication. Getting around those measures can be
|
|||||||
difficult and tricky, and may sometimes require special infrastructure. Please
|
difficult and tricky, and may sometimes require special infrastructure. Please
|
||||||
consider contacting `commercial support`_ if in doubt.
|
consider contacting `commercial support`_ if in doubt.
|
||||||
|
|
||||||
Here are some tips to keep in mind when dealing with these kind of sites:
|
Here are some tips to keep in mind when dealing with these kinds of sites:
|
||||||
|
|
||||||
* rotate your user agent from a pool of well-known ones from browsers (google
|
* rotate your user agent from a pool of well-known ones from browsers (google
|
||||||
around to get a list of them)
|
around to get a list of them)
|
||||||
|
@ -579,7 +579,7 @@ Built-in Selectors reference
|
|||||||
is used together with ``text``.
|
is used together with ``text``.
|
||||||
|
|
||||||
If ``type`` is ``None`` and a ``response`` is passed, the selector type is
|
If ``type`` is ``None`` and a ``response`` is passed, the selector type is
|
||||||
inferred from the response type as follow:
|
inferred from the response type as follows:
|
||||||
|
|
||||||
* ``"html"`` for :class:`~scrapy.http.HtmlResponse` type
|
* ``"html"`` for :class:`~scrapy.http.HtmlResponse` type
|
||||||
* ``"xml"`` for :class:`~scrapy.http.XmlResponse` type
|
* ``"xml"`` for :class:`~scrapy.http.XmlResponse` type
|
||||||
@ -757,7 +757,7 @@ nodes can be accessed directly by their names::
|
|||||||
<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
|
<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
|
||||||
...
|
...
|
||||||
|
|
||||||
If you wonder why the namespace removal procedure isn't called always by default
|
If you wonder why the namespace removal procedure isn't always called by default
|
||||||
instead of having to call it manually, this is because of two reasons, which, in order
|
instead of having to call it manually, this is because of two reasons, which, in order
|
||||||
of relevance, are:
|
of relevance, are:
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user