scrapy/docs/topics/practices.rst

.. _topics-practices:

================
Common Practices
================

This section documents common practices when using Scrapy. These are things
that cover many topics and don't often fall into any other specific section.

.. _run-from-script:

Run Scrapy from a script
========================

You can use the :ref:`API <topics-api>` to run Scrapy from a script, instead of
the typical way of running Scrapy via ``scrapy crawl``.

Remember that Scrapy is built on top of the Twisted
asynchronous networking library, so you need to run it inside the Twisted reactor.

Note that you will also have to shutdown the Twisted reactor yourself after the
spider is finished. This can be achieved by adding callbacks to the deferred
returned by the :meth:`CrawlerRunner.crawl
<scrapy.crawler.CrawlerRunner.crawl>` method.

What follows is a working example of how to do that, using the `testspiders`_
project as example.

::

    from twisted.internet import reactor
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    from scrapy.utils.project import get_project_settings

    settings = get_project_settings()
    configure_logging(settings)
    runner = CrawlerRunner(settings)

    # 'followall' is the name of one of the spiders of the project.
    d = runner.crawl('followall', domain='scrapinghub.com')
    d.addBoth(lambda _: reactor.stop())
    reactor.run() # the script will block here until the crawling is finished

Running spiders outside projects it's not much different. You have to create a
generic :class:`~scrapy.settings.Settings` object and populate it as needed
(See :ref:`topics-settings-ref` for the available settings), instead of using
the configuration returned by `get_project_settings`.

Spiders can still be referenced by their name if :setting:`SPIDER_MODULES` is
set with the modules where Scrapy should look for spiders.  Otherwise, passing
the spider class as first argument in the :meth:`CrawlerRunner.crawl
<scrapy.crawler.CrawlerRunner.crawl>` method is enough.

::

    from twisted.internet import reactor
    import scrapy
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging

    class MySpider(scrapy.Spider):
        # Your spider definition
        ...

    configure_logging(settings)
    runner = CrawlerRunner({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })

    d = runner.crawl(MySpider)
    d.addBoth(lambda _: reactor.stop())
    reactor.run() # the script will block here until the crawling is finished

.. seealso:: `Twisted Reactor Overview`_.

.. _run-multiple-spiders:

Running multiple spiders in the same process
============================================

By default, Scrapy runs a single spider per process when you run ``scrapy
crawl``. However, Scrapy supports running multiple spiders per process using
the :ref:`internal API <topics-api>`.

Here is an example that runs multiple spiders simultaneously, using the
`testspiders`_ project:

::

    from twisted.internet import reactor, defer
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    from scrapy.utils.project import get_project_settings

    settings = get_project_settings()
    configure_logging(settings)
    runner = CrawlerRunner(settings)
    dfs = set()
    for domain in ['scrapinghub.com', 'insophia.com']:
        d = runner.crawl('followall', domain=domain)
        dfs.add(d)

    defer.DeferredList(dfs).addBoth(lambda _: reactor.stop())
    reactor.run() # the script will block here until all crawling jobs are finished

Same example but running the spiders sequentially by chaining the deferreds:

::

    from twisted.internet import reactor, defer
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    from scrapy.utils.project import get_project_settings

    settings = get_project_settings()
    configure_logging(settings)
    runner = CrawlerRunner(settings)

    @defer.inlineCallbacks
    def crawl():
        for domain in ['scrapinghub.com', 'insophia.com']:
            yield runner.crawl('followall', domain=domain)
        reactor.stop()

    crawl()
    reactor.run() # the script will block here until the last crawl call is finished

.. seealso:: :ref:`run-from-script`.

.. _distributed-crawls:

Distributed crawls
==================

Scrapy doesn't provide any built-in facility for running crawls in a distribute
(multi-server) manner. However, there are some ways to distribute crawls, which
vary depending on how you plan to distribute them.

If you have many spiders, the obvious way to distribute the load is to setup
many Scrapyd instances and distribute spider runs among those.

If you instead want to run a single (big) spider through many machines, what
you usually do is partition the urls to crawl and send them to each separate
spider. Here is a concrete example:

First, you prepare the list of urls to crawl and put them into separate
files/urls::

    http://somedomain.com/urls-to-crawl/spider1/part1.list
    http://somedomain.com/urls-to-crawl/spider1/part2.list
    http://somedomain.com/urls-to-crawl/spider1/part3.list

Then you fire a spider run on 3 different Scrapyd servers. The spider would
receive a (spider) argument ``part`` with the number of the partition to
crawl::

    curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
    curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2
    curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3

.. _bans:

Avoiding getting banned
=======================

Some websites implement certain measures to prevent bots from crawling them,
with varying degrees of sophistication. Getting around those measures can be
difficult and tricky, and may sometimes require special infrastructure. Please
consider contacting `commercial support`_ if in doubt.

Here are some tips to keep in mind when dealing with these kind of sites:

* rotate your user agent from a pool of well-known ones from browsers (google
  around to get a list of them)
* disable cookies (see :setting:`COOKIES_ENABLED`) as some sites may use
  cookies to spot bot behaviour
* use download delays (2 or higher). See :setting:`DOWNLOAD_DELAY` setting.
* if possible, use `Google cache`_ to fetch pages, instead of hitting the sites
  directly
* use a pool of rotating IPs. For example, the free `Tor project`_ or paid
  services like `ProxyMesh`_
* use a highly distributed downloader that circumvents bans internally, so you
  can just focus on parsing clean pages. One example of such downloaders is
  `Crawlera`_

If you are still unable to prevent your bot getting banned, consider contacting
`commercial support`_.

.. _Tor project: https://www.torproject.org/
.. _commercial support: http://scrapy.org/support/
.. _ProxyMesh: http://proxymesh.com/
.. _Google cache: http://www.googleguide.com/cached_pages.html
.. _testspiders: https://github.com/scrapinghub/testspiders
.. _Twisted Reactor Overview: http://twistedmatrix.com/documents/current/core/howto/reactor-basics.html
.. _Crawlera: http://crawlera.com
add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00			`.. _topics-practices:`

			`================`
			`Common Practices`
			`================`

fixed doc typos 2012-12-26 16:16:53 -02:00			`This section documents common practices when using Scrapy. These are things`
			`that cover many topics and don't often fall into any other specific section.`
add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00
			`.. _run-from-script:`

			`Run Scrapy from a script`
			`========================`

fixed doc typos 2012-12-26 16:16:53 -02:00			You can use the :ref:`API <topics-api>` to run Scrapy from a script, instead of
add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00			the typical way of running Scrapy via ``scrapy crawl``.

added a bit more documentation on how to close the reactor when running scrapy from a script 2013-06-25 09:18:42 +02:00			`Remember that Scrapy is built on top of the Twisted`
Added missing word in practices.rst 2014-02-27 09:53:11 -05:00			`asynchronous networking library, so you need to run it inside the Twisted reactor.`
add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00
added a bit more documentation on how to close the reactor when running scrapy from a script 2013-06-25 09:18:42 +02:00			`Note that you will also have to shutdown the Twisted reactor yourself after the`
CrawlerProcess cleanup changes 2014-07-30 05:35:18 -03:00			`spider is finished. This can be achieved by adding callbacks to the deferred`
			returned by the :meth:`CrawlerRunner.crawl
			<scrapy.crawler.CrawlerRunner.crawl>` method.
added a bit more documentation on how to close the reactor when running scrapy from a script 2013-06-25 09:18:42 +02:00
			What follows is a working example of how to do that, using the `testspiders`_
			`project as example.`

add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00			`::`

			`from twisted.internet import reactor`
CrawlerProcess cleanup changes 2014-07-30 05:35:18 -03:00			`from scrapy.crawler import CrawlerRunner`
Fix logging usage across docs 2015-03-10 15:59:44 -03:00			`from scrapy.utils.log import configure_logging`
Update practices.rst With this modification scrapy runs the spider with project settings. The previous example ran only with default settings resulting in ignoring all user settings as pipelines for example. 2013-09-27 17:56:30 -03:00			`from scrapy.utils.project import get_project_settings`
add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00
Fix logging usage across docs 2015-03-10 15:59:44 -03:00			`settings = get_project_settings()`
			`configure_logging(settings)`
			`runner = CrawlerRunner(settings)`
CrawlerProcess cleanup changes 2014-07-30 05:35:18 -03:00
Add example on running spiders outside projects 2014-08-14 11:50:33 -03:00			`# 'followall' is the name of one of the spiders of the project.`
CrawlerProcess cleanup changes 2014-07-30 05:35:18 -03:00			`d = runner.crawl('followall', domain='scrapinghub.com')`
			`d.addBoth(lambda _: reactor.stop())`
Add example on running spiders outside projects 2014-08-14 11:50:33 -03:00			`reactor.run() # the script will block here until the crawling is finished`

			`Running spiders outside projects it's not much different. You have to create a`
			generic :class:`~scrapy.settings.Settings` object and populate it as needed
			(See :ref:`topics-settings-ref` for the available settings), instead of using
			the configuration returned by `get_project_settings`.

			Spiders can still be referenced by their name if :setting:`SPIDER_MODULES` is
			`set with the modules where Scrapy should look for spiders. Otherwise, passing`
			the spider class as first argument in the :meth:`CrawlerRunner.crawl
			<scrapy.crawler.CrawlerRunner.crawl>` method is enough.

			`::`

			`from twisted.internet import reactor`
allow Crawler, CrawlerRunner and CrawlerProcess to accept dicts instead of Setting objects 2015-04-14 23:26:05 +05:00			`import scrapy`
Add example on running spiders outside projects 2014-08-14 11:50:33 -03:00			`from scrapy.crawler import CrawlerRunner`
Fix logging usage across docs 2015-03-10 15:59:44 -03:00			`from scrapy.utils.log import configure_logging`
Add example on running spiders outside projects 2014-08-14 11:50:33 -03:00
allow Crawler, CrawlerRunner and CrawlerProcess to accept dicts instead of Setting objects 2015-04-14 23:26:05 +05:00			`class MySpider(scrapy.Spider):`
Add example on running spiders outside projects 2014-08-14 11:50:33 -03:00			`# Your spider definition`
			`...`

Fix logging usage across docs 2015-03-10 15:59:44 -03:00			`configure_logging(settings)`
allow Crawler, CrawlerRunner and CrawlerProcess to accept dicts instead of Setting objects 2015-04-14 23:26:05 +05:00			`runner = CrawlerRunner({`
			`'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'`
			`})`
Add example on running spiders outside projects 2014-08-14 11:50:33 -03:00
			`d = runner.crawl(MySpider)`
			`d.addBoth(lambda _: reactor.stop())`
CrawlerProcess cleanup changes 2014-07-30 05:35:18 -03:00			`reactor.run() # the script will block here until the crawling is finished`
add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00
fixed doc typos 2012-12-26 16:16:53 -02:00			.. seealso:: `Twisted Reactor Overview`_.

DOC bring back notes about multiple spiders per process because it is now documented how to do that 2014-09-21 07:12:01 +06:00			`.. _run-multiple-spiders:`

add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00			`Running multiple spiders in the same process`
			`============================================`

			By default, Scrapy runs a single spider per process when you run ``scrapy
fixed doc typos 2012-12-26 16:16:53 -02:00			crawl``. However, Scrapy supports running multiple spiders per process using
			the :ref:`internal API <topics-api>`.
add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00
CrawlerProcess cleanup changes 2014-07-30 05:35:18 -03:00			`Here is an example that runs multiple spiders simultaneously, using the`
			`testspiders`_ project:
add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00
			`::`

CrawlerProcess cleanup changes 2014-07-30 05:35:18 -03:00			`from twisted.internet import reactor, defer`
			`from scrapy.crawler import CrawlerRunner`
Fix logging usage across docs 2015-03-10 15:59:44 -03:00			`from scrapy.utils.log import configure_logging`
Update the second code example Update the second code example to reflect the last change in the first example. 2013-09-27 18:22:33 -03:00			`from scrapy.utils.project import get_project_settings`
add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00
Fix logging usage across docs 2015-03-10 15:59:44 -03:00			`settings = get_project_settings()`
			`configure_logging(settings)`
			`runner = CrawlerRunner(settings)`
CrawlerProcess cleanup changes 2014-07-30 05:35:18 -03:00			`dfs = set()`
add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00			`for domain in ['scrapinghub.com', 'insophia.com']:`
CrawlerProcess cleanup changes 2014-07-30 05:35:18 -03:00			`d = runner.crawl('followall', domain=domain)`
			`dfs.add(d)`

			`defer.DeferredList(dfs).addBoth(lambda _: reactor.stop())`
			`reactor.run() # the script will block here until all crawling jobs are finished`

			`Same example but running the spiders sequentially by chaining the deferreds:`

			`::`

			`from twisted.internet import reactor, defer`
			`from scrapy.crawler import CrawlerRunner`
Fix logging usage across docs 2015-03-10 15:59:44 -03:00			`from scrapy.utils.log import configure_logging`
CrawlerProcess cleanup changes 2014-07-30 05:35:18 -03:00			`from scrapy.utils.project import get_project_settings`

Fix logging usage across docs 2015-03-10 15:59:44 -03:00			`settings = get_project_settings()`
			`configure_logging(settings)`
			`runner = CrawlerRunner(settings)`
CrawlerProcess cleanup changes 2014-07-30 05:35:18 -03:00
			`@defer.inlineCallbacks`
			`def crawl():`
			`for domain in ['scrapinghub.com', 'insophia.com']:`
			`yield runner.crawl('followall', domain=domain)`
			`reactor.stop()`

			`crawl()`
			`reactor.run() # the script will block here until the last crawl call is finished`
add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00
fixed doc typos 2012-12-26 16:16:53 -02:00			.. seealso:: :ref:`run-from-script`.
add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00
			`.. _distributed-crawls:`

			`Distributed crawls`
			`==================`

fixed doc typos 2012-12-26 16:16:53 -02:00			`Scrapy doesn't provide any built-in facility for running crawls in a distribute`
			`(multi-server) manner. However, there are some ways to distribute crawls, which`
			`vary depending on how you plan to distribute them.`
add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00
			`If you have many spiders, the obvious way to distribute the load is to setup`
			`many Scrapyd instances and distribute spider runs among those.`

			`If you instead want to run a single (big) spider through many machines, what`
fixed doc typos 2012-12-26 16:16:53 -02:00			`you usually do is partition the urls to crawl and send them to each separate`
add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00			`spider. Here is a concrete example:`

fixed doc typos 2012-12-26 16:16:53 -02:00			`First, you prepare the list of urls to crawl and put them into separate`
add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00			`files/urls::`

			`http://somedomain.com/urls-to-crawl/spider1/part1.list`
			`http://somedomain.com/urls-to-crawl/spider1/part2.list`
			`http://somedomain.com/urls-to-crawl/spider1/part3.list`

fixed doc typos 2012-12-26 16:16:53 -02:00			`Then you fire a spider run on 3 different Scrapyd servers. The spider would`
			receive a (spider) argument ``part`` with the number of the partition to
add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00			`crawl::`

			`curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1`
			`curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2`
			`curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3`

			`.. _bans:`

			`Avoiding getting banned`
			`=======================`

			`Some websites implement certain measures to prevent bots from crawling them,`
			`with varying degrees of sophistication. Getting around those measures can be`
			`difficult and tricky, and may sometimes require special infrastructure. Please`
			consider contacting `commercial support`_ if in doubt.

			`Here are some tips to keep in mind when dealing with these kind of sites:`

			`* rotate your user agent from a pool of well-known ones from browsers (google`
			`around to get a list of them)`
			* disable cookies (see :setting:`COOKIES_ENABLED`) as some sites may use
			`cookies to spot bot behaviour`
			* use download delays (2 or higher). See :setting:`DOWNLOAD_DELAY` setting.
fixed doc typos 2012-12-26 16:16:53 -02:00			* if possible, use `Google cache`_ to fetch pages, instead of hitting the sites
add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00			`directly`
			* use a pool of rotating IPs. For example, the free `Tor project`_ or paid
			services like `ProxyMesh`_
mention crawlera in best practices, as a way to deal with bans 2013-05-04 18:19:45 -03:00			`* use a highly distributed downloader that circumvents bans internally, so you`
			`can just focus on parsing clean pages. One example of such downloaders is`
			`Crawlera`_
add documentation topics: Broad Crawls & Common Practies 2012-12-26 14:02:13 -02:00
			`If you are still unable to prevent your bot getting banned, consider contacting`
			`commercial support`_.

			`.. _Tor project: https://www.torproject.org/`
			`.. _commercial support: http://scrapy.org/support/`
			`.. _ProxyMesh: http://proxymesh.com/`
			`.. _Google cache: http://www.googleguide.com/cached_pages.html`
			`.. _testspiders: https://github.com/scrapinghub/testspiders`
fixed doc typos 2012-12-26 16:16:53 -02:00			`.. _Twisted Reactor Overview: http://twistedmatrix.com/documents/current/core/howto/reactor-basics.html`
mention crawlera in best practices, as a way to deal with bans 2013-05-04 18:19:45 -03:00			`.. _Crawlera: http://crawlera.com`