1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-24 22:04:16 +00:00

AjaxCrawlableMiddleware in Broad Crawl docs

This commit is contained in:
Mikhail Korobov 2013-12-19 01:01:26 +06:00
parent 84a3a9daac
commit 943a0bd264
2 changed files with 25 additions and 1 deletions

View File

@ -118,3 +118,26 @@ crawler to dedicate too many resources on any specific domain.
To disable redirects use::
REDIRECT_ENABLED = False
Enable crawling of "Ajax Crawlable Pages"
=========================================
Some pages (up to 1%) declare themselves as `ajax crawlable`_. This means they
provide plain HTML version of content that is usually available only via AJAX.
Pages can indicate it in two ways:
1) by using ``#!`` in URL - this is the default way;
2) by using a special meta tag - this way is used on
"main", "index" website pages.
Scrapy handles (1) automatically; to handle (2) enable
:ref:`AjaxCrawlableMiddleware <ajaxcrawlable-middleware>`::
AJAXCRAWLABLE_ENABLED = True
When doing broad crawls it's common to crawl a lot of "index" web pages;
AjaxCrawlableMiddleware helps to crawl them correctly.
It is turned OFF by default because it has some performance overhead,
and enabling it for focused crawls doesn't make much sense.
.. _ajax crawlable: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started

View File

@ -795,6 +795,7 @@ UserAgentMiddleware
In order for a spider to override the default user agent, its `user_agent`
attribute must be set.
.. _ajaxcrawlable-middleware:
AjaxCrawlableMiddleware
-----------------------
@ -823,7 +824,7 @@ AjaxCrawlableMiddleware Settings
AJAXCRAWLABLE_ENABLED
^^^^^^^^^^^^^^^^^^^^^
.. versionadded:: 0.17
.. versionadded:: 0.21
Default: ``False``