AjaxCrawlableMiddleware in Broad Crawl docs

2025-02-24 22:04:16 +00:00 · 2013-12-19 01:01:26 +06:00 · 2013-12-19 01:01:26 +06:00 · 943a0bd264
commit 943a0bd264
parent 84a3a9daac
2 changed files with 25 additions and 1 deletions
--- a/docs/topics/broad-crawls.rst
+++ b/docs/topics/broad-crawls.rst
@ -118,3 +118,26 @@ crawler to dedicate too many resources on any specific domain.
 To disable redirects use::

    REDIRECT_ENABLED = False
+
+Enable crawling of "Ajax Crawlable Pages"
+=========================================
+
+Some pages (up to 1%) declare themselves as `ajax crawlable`_. This means they
+provide plain HTML version of content that is usually available only via AJAX.
+Pages can indicate it in two ways:
+
+1) by using ``#!`` in URL - this is the default way;
+2) by using a special meta tag - this way is used on
+   "main", "index" website pages.
+
+Scrapy handles (1) automatically; to handle (2) enable
+:ref:`AjaxCrawlableMiddleware <ajaxcrawlable-middleware>`::
+
+    AJAXCRAWLABLE_ENABLED = True
+
+When doing broad crawls it's common to crawl a lot of "index" web pages;
+AjaxCrawlableMiddleware helps to crawl them correctly.
+It is turned OFF by default because it has some performance overhead,
+and enabling it for focused crawls doesn't make much sense.
+
+.. _ajax crawlable: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
--- a/docs/topics/downloader-middleware.rst
+++ b/docs/topics/downloader-middleware.rst
@ -795,6 +795,7 @@ UserAgentMiddleware
   In order for a spider to override the default user agent, its `user_agent`
   attribute must be set.

+.. _ajaxcrawlable-middleware:

 AjaxCrawlableMiddleware
 -----------------------
@ -823,7 +824,7 @@ AjaxCrawlableMiddleware Settings
 AJAXCRAWLABLE_ENABLED
 ^^^^^^^^^^^^^^^^^^^^^

-.. versionadded:: 0.17
+.. versionadded:: 0.21

 Default: ``False``