diff --git a/docs/topics/broad-crawls.rst b/docs/topics/broad-crawls.rst index 189620a0b..3264ebe23 100644 --- a/docs/topics/broad-crawls.rst +++ b/docs/topics/broad-crawls.rst @@ -118,3 +118,26 @@ crawler to dedicate too many resources on any specific domain. To disable redirects use:: REDIRECT_ENABLED = False + +Enable crawling of "Ajax Crawlable Pages" +========================================= + +Some pages (up to 1%) declare themselves as `ajax crawlable`_. This means they +provide plain HTML version of content that is usually available only via AJAX. +Pages can indicate it in two ways: + +1) by using ``#!`` in URL - this is the default way; +2) by using a special meta tag - this way is used on + "main", "index" website pages. + +Scrapy handles (1) automatically; to handle (2) enable +:ref:`AjaxCrawlableMiddleware `:: + + AJAXCRAWLABLE_ENABLED = True + +When doing broad crawls it's common to crawl a lot of "index" web pages; +AjaxCrawlableMiddleware helps to crawl them correctly. +It is turned OFF by default because it has some performance overhead, +and enabling it for focused crawls doesn't make much sense. + +.. _ajax crawlable: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started diff --git a/docs/topics/downloader-middleware.rst b/docs/topics/downloader-middleware.rst index 541cff31c..82c01d1e6 100644 --- a/docs/topics/downloader-middleware.rst +++ b/docs/topics/downloader-middleware.rst @@ -795,6 +795,7 @@ UserAgentMiddleware In order for a spider to override the default user agent, its `user_agent` attribute must be set. +.. _ajaxcrawlable-middleware: AjaxCrawlableMiddleware ----------------------- @@ -823,7 +824,7 @@ AjaxCrawlableMiddleware Settings AJAXCRAWLABLE_ENABLED ^^^^^^^^^^^^^^^^^^^^^ -.. versionadded:: 0.17 +.. versionadded:: 0.21 Default: ``False``