mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-22 12:13:17 +00:00
Fixed minor grammar issues.
This commit is contained in:
parent
57f87b95d4
commit
0025d5a943
@ -144,7 +144,7 @@ I get "Filtered offsite request" messages. How can I fix them?
|
||||
Those messages (logged with ``DEBUG`` level) don't necessarily mean there is a
|
||||
problem, so you may not need to fix them.
|
||||
|
||||
Those message are thrown by the Offsite Spider Middleware, which is a spider
|
||||
Those messages are thrown by the Offsite Spider Middleware, which is a spider
|
||||
middleware (enabled by default) whose purpose is to filter out requests to
|
||||
domains outside the ones covered by the spider.
|
||||
|
||||
|
@ -34,7 +34,7 @@ These are some common properties often found in broad crawls:
|
||||
|
||||
As said above, Scrapy default settings are optimized for focused crawls, not
|
||||
broad crawls. However, due to its asynchronous architecture, Scrapy is very
|
||||
well suited for performing fast broad crawls. This page summarize some things
|
||||
well suited for performing fast broad crawls. This page summarizes some things
|
||||
you need to keep in mind when using Scrapy for doing broad crawls, along with
|
||||
concrete suggestions of Scrapy settings to tune in order to achieve an
|
||||
efficient broad crawl.
|
||||
@ -46,7 +46,7 @@ Concurrency is the number of requests that are processed in parallel. There is
|
||||
a global limit and a per-domain limit.
|
||||
|
||||
The default global concurrency limit in Scrapy is not suitable for crawling
|
||||
many different domains in parallel, so you will want to increase it. How much
|
||||
many different domains in parallel, so you will want to increase it. How much
|
||||
to increase it will depend on how much CPU you crawler will have available. A
|
||||
good starting point is ``100``, but the best way to find out is by doing some
|
||||
trials and identifying at what concurrency your Scrapy process gets CPU
|
||||
|
@ -17,7 +17,7 @@ Extensions use the :ref:`Scrapy settings <topics-settings>` to manage their
|
||||
settings, just like any other Scrapy code.
|
||||
|
||||
It is customary for extensions to prefix their settings with their own name, to
|
||||
avoid collision with existing (and future) extensions. For example, an
|
||||
avoid collision with existing (and future) extensions. For example, a
|
||||
hypothetic extension to handle `Google Sitemaps`_ would use settings like
|
||||
`GOOGLESITEMAP_ENABLED`, `GOOGLESITEMAP_DEPTH`, and so on.
|
||||
|
||||
@ -143,7 +143,7 @@ Here is the code of such extension::
|
||||
self.items_scraped += 1
|
||||
if self.items_scraped % self.item_count == 0:
|
||||
logger.info("scraped %d items", self.items_scraped)
|
||||
|
||||
|
||||
|
||||
.. _topics-extensions-ref:
|
||||
|
||||
|
@ -95,7 +95,7 @@ contain a price::
|
||||
Write items to a JSON file
|
||||
--------------------------
|
||||
|
||||
The following pipeline stores all scraped items (from all spiders) into a a
|
||||
The following pipeline stores all scraped items (from all spiders) into a
|
||||
single ``items.jl`` file, containing one item per line serialized in JSON
|
||||
format::
|
||||
|
||||
|
@ -61,7 +61,7 @@ the example above.
|
||||
You can specify any kind of metadata for each field. There is no restriction on
|
||||
the values accepted by :class:`Field` objects. For this same
|
||||
reason, there is no reference list of all available metadata keys. Each key
|
||||
defined in :class:`Field` objects could be used by a different components, and
|
||||
defined in :class:`Field` objects could be used by a different component, and
|
||||
only those components know about it. You can also define and use any other
|
||||
:class:`Field` key in your project too, for your own needs. The main goal of
|
||||
:class:`Field` objects is to provide a way to define all field metadata in one
|
||||
|
@ -97,7 +97,7 @@ subclasses):
|
||||
A real example
|
||||
--------------
|
||||
|
||||
Let's see a concrete example of an hypothetical case of memory leaks.
|
||||
Let's see a concrete example of a hypothetical case of memory leaks.
|
||||
Suppose we have some spider with a line similar to this one::
|
||||
|
||||
return Request("http://www.somenastyspider.com/product.php?pid=%d" % product_id,
|
||||
|
@ -228,7 +228,7 @@ with varying degrees of sophistication. Getting around those measures can be
|
||||
difficult and tricky, and may sometimes require special infrastructure. Please
|
||||
consider contacting `commercial support`_ if in doubt.
|
||||
|
||||
Here are some tips to keep in mind when dealing with these kind of sites:
|
||||
Here are some tips to keep in mind when dealing with these kinds of sites:
|
||||
|
||||
* rotate your user agent from a pool of well-known ones from browsers (google
|
||||
around to get a list of them)
|
||||
|
@ -579,7 +579,7 @@ Built-in Selectors reference
|
||||
is used together with ``text``.
|
||||
|
||||
If ``type`` is ``None`` and a ``response`` is passed, the selector type is
|
||||
inferred from the response type as follow:
|
||||
inferred from the response type as follows:
|
||||
|
||||
* ``"html"`` for :class:`~scrapy.http.HtmlResponse` type
|
||||
* ``"xml"`` for :class:`~scrapy.http.XmlResponse` type
|
||||
@ -757,7 +757,7 @@ nodes can be accessed directly by their names::
|
||||
<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
|
||||
...
|
||||
|
||||
If you wonder why the namespace removal procedure isn't called always by default
|
||||
If you wonder why the namespace removal procedure isn't always called by default
|
||||
instead of having to call it manually, this is because of two reasons, which, in order
|
||||
of relevance, are:
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user