mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-24 22:04:16 +00:00
AjaxCrawlableMiddleware in Broad Crawl docs
This commit is contained in:
parent
84a3a9daac
commit
943a0bd264
@ -118,3 +118,26 @@ crawler to dedicate too many resources on any specific domain.
|
||||
To disable redirects use::
|
||||
|
||||
REDIRECT_ENABLED = False
|
||||
|
||||
Enable crawling of "Ajax Crawlable Pages"
|
||||
=========================================
|
||||
|
||||
Some pages (up to 1%) declare themselves as `ajax crawlable`_. This means they
|
||||
provide plain HTML version of content that is usually available only via AJAX.
|
||||
Pages can indicate it in two ways:
|
||||
|
||||
1) by using ``#!`` in URL - this is the default way;
|
||||
2) by using a special meta tag - this way is used on
|
||||
"main", "index" website pages.
|
||||
|
||||
Scrapy handles (1) automatically; to handle (2) enable
|
||||
:ref:`AjaxCrawlableMiddleware <ajaxcrawlable-middleware>`::
|
||||
|
||||
AJAXCRAWLABLE_ENABLED = True
|
||||
|
||||
When doing broad crawls it's common to crawl a lot of "index" web pages;
|
||||
AjaxCrawlableMiddleware helps to crawl them correctly.
|
||||
It is turned OFF by default because it has some performance overhead,
|
||||
and enabling it for focused crawls doesn't make much sense.
|
||||
|
||||
.. _ajax crawlable: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
|
||||
|
@ -795,6 +795,7 @@ UserAgentMiddleware
|
||||
In order for a spider to override the default user agent, its `user_agent`
|
||||
attribute must be set.
|
||||
|
||||
.. _ajaxcrawlable-middleware:
|
||||
|
||||
AjaxCrawlableMiddleware
|
||||
-----------------------
|
||||
@ -823,7 +824,7 @@ AjaxCrawlableMiddleware Settings
|
||||
AJAXCRAWLABLE_ENABLED
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. versionadded:: 0.17
|
||||
.. versionadded:: 0.21
|
||||
|
||||
Default: ``False``
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user