2009-01-03 01:24:18 +00:00
|
|
|
.. _faq:
|
|
|
|
|
|
|
|
Frequently Asked Questions
|
|
|
|
==========================
|
|
|
|
|
2011-02-16 08:57:42 -02:00
|
|
|
How does Scrapy compare to BeautifulSoup or lxml?
|
2009-01-03 09:14:52 +00:00
|
|
|
-------------------------------------------------
|
2009-01-03 01:24:18 +00:00
|
|
|
|
|
|
|
`BeautifulSoup`_ and `lxml`_ are libraries for parsing HTML and XML. Scrapy is
|
|
|
|
an application framework for writing web spiders that crawl web sites and
|
2009-08-26 00:18:58 -03:00
|
|
|
extract data from them.
|
|
|
|
|
|
|
|
Scrapy provides a built-in mechanism for extracting data (called
|
|
|
|
:ref:`selectors <topics-selectors>`) but you can easily use `BeautifulSoup`_
|
|
|
|
(or `lxml`_) instead, if you feel more comfortable working with them. After
|
|
|
|
all, they're just parsing libraries which can be imported and used from any
|
|
|
|
Python code.
|
2009-01-03 01:24:18 +00:00
|
|
|
|
2010-08-20 11:26:14 -03:00
|
|
|
In other words, comparing `BeautifulSoup`_ (or `lxml`_) to Scrapy is like
|
|
|
|
comparing `jinja2`_ to `Django`_.
|
2009-01-03 01:24:18 +00:00
|
|
|
|
|
|
|
.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
|
|
|
|
.. _lxml: http://codespeak.net/lxml/
|
2010-08-20 11:26:14 -03:00
|
|
|
.. _jinja2: http://jinja.pocoo.org/2/
|
2009-01-03 01:24:18 +00:00
|
|
|
.. _Django: http://www.djangoproject.com
|
|
|
|
|
2010-11-16 03:31:04 -02:00
|
|
|
.. _faq-python-versions:
|
|
|
|
|
|
|
|
What Python versions does Scrapy support?
|
|
|
|
-----------------------------------------
|
|
|
|
|
2012-03-01 08:18:12 -02:00
|
|
|
Scrapy runs in Python 2.6 and 2.7.
|
2010-11-16 03:31:04 -02:00
|
|
|
|
2009-01-03 07:40:40 +00:00
|
|
|
Does Scrapy work with Python 3.0?
|
|
|
|
---------------------------------
|
2009-01-03 01:24:18 +00:00
|
|
|
|
2010-08-14 21:10:37 -03:00
|
|
|
No, and there are no plans to port Scrapy to Python 3.0 yet. At the moment,
|
2012-03-01 08:18:12 -02:00
|
|
|
Scrapy works with Python 2.6 and 2.7.
|
2010-11-17 21:32:23 -02:00
|
|
|
|
|
|
|
.. seealso:: :ref:`faq-python-versions`.
|
2009-01-03 01:24:18 +00:00
|
|
|
|
2009-07-01 09:51:57 -03:00
|
|
|
Did Scrapy "steal" X from Django?
|
|
|
|
---------------------------------
|
2009-01-09 21:18:41 +00:00
|
|
|
|
2009-01-09 22:45:58 +00:00
|
|
|
Probably, but we don't like that word. We think Django_ is a great open source
|
2009-01-09 21:18:41 +00:00
|
|
|
project and an example to follow, so we've used it as an inspiration for
|
|
|
|
Scrapy.
|
|
|
|
|
|
|
|
We believe that, if something is already done well, there's no need to reinvent
|
|
|
|
it. This concept, besides being one of the foundations for open source and free
|
|
|
|
software, not only applies to software but also to documentation, procedures,
|
|
|
|
policies, etc. So, instead of going through each problem ourselves, we choose
|
|
|
|
to copy ideas from those projects that have already solved them properly, and
|
|
|
|
focus on the real problems we need to solve.
|
|
|
|
|
|
|
|
We'd be proud if Scrapy serves as an inspiration for other projects. Feel free
|
2009-02-19 11:07:41 +00:00
|
|
|
to steal from us!
|
2009-01-09 21:18:41 +00:00
|
|
|
|
2009-01-09 22:45:58 +00:00
|
|
|
.. _Django: http://www.djangoproject.com
|
2009-02-21 21:26:10 +00:00
|
|
|
|
|
|
|
Does Scrapy work with HTTP proxies?
|
|
|
|
-----------------------------------
|
|
|
|
|
2009-10-07 21:00:34 -02:00
|
|
|
Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP
|
|
|
|
Proxy downloader middleware. See
|
|
|
|
:class:`~scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware`.
|
2009-02-21 21:26:10 +00:00
|
|
|
|
2009-04-10 09:51:50 +00:00
|
|
|
Scrapy crashes with: ImportError: No module named win32api
|
|
|
|
----------------------------------------------------------
|
|
|
|
|
|
|
|
You need to install `pywin32`_ because of `this Twisted bug`_.
|
|
|
|
|
|
|
|
.. _pywin32: http://sourceforge.net/projects/pywin32/
|
|
|
|
.. _this Twisted bug: http://twistedmatrix.com/trac/ticket/3707
|
|
|
|
|
2009-04-12 08:31:55 +00:00
|
|
|
How can I simulate a user login in my spider?
|
|
|
|
---------------------------------------------
|
2009-04-10 09:51:50 +00:00
|
|
|
|
2009-08-18 14:05:15 -03:00
|
|
|
See :ref:`topics-request-response-ref-request-userlogin`.
|
2009-04-17 13:06:31 +00:00
|
|
|
|
2011-08-02 11:57:55 -03:00
|
|
|
Does Scrapy crawl in breath-first or depth-first order?
|
|
|
|
-------------------------------------------------------
|
2009-04-17 13:06:31 +00:00
|
|
|
|
2011-09-23 13:22:25 -03:00
|
|
|
By default, Scrapy uses a `LIFO`_ queue for storing pending requests, which
|
|
|
|
basically means that it crawls in `DFO order`_. This order is more convenient
|
|
|
|
in most cases. If you do want to crawl in true `BFO order`_, you can do it by
|
|
|
|
setting the following settings::
|
|
|
|
|
|
|
|
DEPTH_PRIORITY = 1
|
|
|
|
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
|
|
|
|
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
|
2009-06-23 16:08:58 -03:00
|
|
|
|
2009-08-21 16:07:16 -03:00
|
|
|
My Scrapy crawler has memory leaks. What can I do?
|
2009-06-23 16:08:58 -03:00
|
|
|
--------------------------------------------------
|
|
|
|
|
2009-08-21 16:07:16 -03:00
|
|
|
See :ref:`topics-leaks`.
|
2009-06-23 16:08:58 -03:00
|
|
|
|
2010-08-20 11:26:14 -03:00
|
|
|
Also, Python has a builtin memory leak issue which is described in
|
|
|
|
:ref:`topics-leaks-without-leaks`.
|
|
|
|
|
|
|
|
How can I make Scrapy consume less memory?
|
|
|
|
------------------------------------------
|
|
|
|
|
|
|
|
See previous question.
|
|
|
|
|
2009-08-24 08:07:20 -03:00
|
|
|
Can I use Basic HTTP Authentication in my spiders?
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
Yes, see :class:`~scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware`.
|
|
|
|
|
2009-08-24 13:56:44 -03:00
|
|
|
Why does Scrapy download pages in English instead of my native language?
|
|
|
|
------------------------------------------------------------------------
|
|
|
|
|
|
|
|
Try changing the default `Accept-Language`_ request header by overriding the
|
|
|
|
:setting:`DEFAULT_REQUEST_HEADERS` setting.
|
|
|
|
|
|
|
|
.. _Accept-Language: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4
|
2009-09-10 18:32:50 -03:00
|
|
|
|
2011-04-28 02:28:39 -03:00
|
|
|
Where can I find some example Scrapy projects?
|
|
|
|
----------------------------------------------
|
2009-09-10 18:32:50 -03:00
|
|
|
|
2011-04-28 02:28:39 -03:00
|
|
|
See :ref:`intro-examples`.
|
2009-09-10 18:32:50 -03:00
|
|
|
|
2009-09-14 22:05:14 -03:00
|
|
|
Can I run a spider without creating a project?
|
|
|
|
----------------------------------------------
|
|
|
|
|
2010-08-20 11:26:14 -03:00
|
|
|
Yes. You can use the :command:`runspider` command. For example, if you have a
|
|
|
|
spider written in a ``my_spider.py`` file you can run it with::
|
2009-09-14 22:05:14 -03:00
|
|
|
|
2010-08-18 19:48:32 -03:00
|
|
|
scrapy runspider my_spider.py
|
2009-09-14 22:05:14 -03:00
|
|
|
|
2010-08-22 19:08:45 -03:00
|
|
|
See :command:`runspider` command for more info.
|
|
|
|
|
2009-11-12 10:17:21 -02:00
|
|
|
I get "Filtered offsite request" messages. How can I fix them?
|
|
|
|
--------------------------------------------------------------
|
|
|
|
|
2009-11-28 16:21:59 -02:00
|
|
|
Those messages (logged with ``DEBUG`` level) don't necessarily mean there is a
|
2009-11-12 12:17:39 -02:00
|
|
|
problem, so you may not need to fix them.
|
2009-11-12 10:17:21 -02:00
|
|
|
|
|
|
|
Those message are thrown by the Offsite Spider Middleware, which is a spider
|
|
|
|
middleware (enabled by default) whose purpose is to filter out requests to
|
|
|
|
domains outside the ones covered by the spider.
|
|
|
|
|
|
|
|
For more info see:
|
|
|
|
:class:`~scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware`.
|
2009-11-28 16:21:59 -02:00
|
|
|
|
2010-06-14 18:21:12 -03:00
|
|
|
What is the recommended way to deploy a Scrapy crawler in production?
|
|
|
|
---------------------------------------------------------------------
|
|
|
|
|
2010-09-05 04:35:27 -03:00
|
|
|
See :ref:`topics-scrapyd`.
|
2010-08-07 15:52:59 -03:00
|
|
|
|
|
|
|
Can I use JSON for large exports?
|
|
|
|
---------------------------------
|
|
|
|
|
|
|
|
It'll depend on how large your output is. See :ref:`this warning
|
|
|
|
<json-with-large-data>` in :class:`~scrapy.contrib.exporter.JsonItemExporter`
|
|
|
|
documentation.
|
2010-08-14 21:10:37 -03:00
|
|
|
|
|
|
|
Can I return (Twisted) deferreds from signal handlers?
|
|
|
|
------------------------------------------------------
|
|
|
|
|
2010-08-22 19:08:45 -03:00
|
|
|
Some signals support returning deferreds from their handlers, others don't. See
|
2010-08-14 21:10:37 -03:00
|
|
|
the :ref:`topics-signals-ref` to know which ones.
|
2010-08-19 16:51:51 -03:00
|
|
|
|
|
|
|
What does the response status code 999 means?
|
|
|
|
---------------------------------------------
|
|
|
|
|
|
|
|
999 is a custom reponse status code used by Yahoo sites to throttle requests.
|
2010-08-19 17:59:52 -03:00
|
|
|
Try slowing down the crawling speed by using a download delay of ``2`` (or
|
|
|
|
higher) in your spider::
|
2010-08-19 16:51:51 -03:00
|
|
|
|
|
|
|
class MySpider(CrawlSpider):
|
|
|
|
|
|
|
|
name = 'myspider'
|
|
|
|
|
2010-09-22 16:09:13 -03:00
|
|
|
DOWNLOAD_DELAY = 2
|
2010-08-19 16:51:51 -03:00
|
|
|
|
|
|
|
# [ ... rest of the spider code ... ]
|
|
|
|
|
|
|
|
Or by setting a global download delay in your project with the
|
|
|
|
:setting:`DOWNLOAD_DELAY` setting.
|
2010-08-20 11:26:14 -03:00
|
|
|
|
|
|
|
Can I call ``pdb.set_trace()`` from my spiders to debug them?
|
|
|
|
-------------------------------------------------------------
|
|
|
|
|
|
|
|
Yes, but you can also use the Scrapy shell which allows you too quickly analyze
|
|
|
|
(and even modify) the response being processed by your spider, which is, quite
|
|
|
|
often, more useful than plain old ``pdb.set_trace()``.
|
|
|
|
|
|
|
|
For more info see :ref:`topics-shell-inspect-response`.
|
2010-08-22 05:59:30 -03:00
|
|
|
|
|
|
|
Simplest way to dump all my scraped items into a JSON/CSV/XML file?
|
|
|
|
-------------------------------------------------------------------
|
|
|
|
|
|
|
|
To dump into a JSON file::
|
|
|
|
|
2011-10-22 20:53:49 -02:00
|
|
|
scrapy crawl myspider -o items.json -t json
|
2010-08-22 05:59:30 -03:00
|
|
|
|
|
|
|
To dump into a CSV file::
|
|
|
|
|
2011-10-22 20:53:49 -02:00
|
|
|
scrapy crawl myspider -o items.csv -t csv
|
2010-08-22 05:59:30 -03:00
|
|
|
|
|
|
|
To dump into a XML file::
|
|
|
|
|
2011-10-22 20:53:49 -02:00
|
|
|
scrapy crawl myspider -o items.xml -t xml
|
2010-08-22 05:59:30 -03:00
|
|
|
|
|
|
|
For more information see :ref:`topics-feed-exports`
|
2010-09-06 13:17:08 -03:00
|
|
|
|
|
|
|
What's this huge cryptic ``__VIEWSTATE`` parameter used in some forms?
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
|
|
|
|
The ``__VIEWSTATE`` parameter is used in sites built with ASP.NET/VB.NET. For
|
|
|
|
more info on how it works see `this page`_. Also, here's an `example spider`_
|
|
|
|
which scrapes one of these sites.
|
|
|
|
|
|
|
|
.. _this page: http://search.cpan.org/~ecarroll/HTML-TreeBuilderX-ASP_NET-0.09/lib/HTML/TreeBuilderX/ASP_NET.pm
|
|
|
|
.. _example spider: http://github.com/AmbientLighter/rpn-fas/blob/master/fas/spiders/rnp.py
|
2011-02-15 07:24:52 -02:00
|
|
|
|
|
|
|
What's the best way to parse big XML/CSV data feeds?
|
|
|
|
----------------------------------------------------
|
|
|
|
|
|
|
|
Parsing big feeds with XPath selectors can be problematic since they need to
|
|
|
|
build the DOM of the entire feed in memory, and this can be quite slow and
|
|
|
|
consume a lot of memory.
|
|
|
|
|
|
|
|
In order to avoid parsing all the entire feed at once in memory, you can use
|
|
|
|
the functions ``xmliter`` and ``csviter`` from ``scrapy.utils.iterators``
|
|
|
|
module. In fact, this is what the feed spiders (see :ref:`topics-spiders`) use
|
|
|
|
under the cover.
|
2011-04-06 14:54:48 -03:00
|
|
|
|
|
|
|
Does Scrapy manage cookies automatically?
|
|
|
|
-----------------------------------------
|
|
|
|
|
|
|
|
Yes, Scrapy receives and keeps track of cookies sent by servers, and sends them
|
|
|
|
back on subsequent requests, like any regular web browser does.
|
|
|
|
|
|
|
|
For more info see :ref:`topics-request-response` and :ref:`cookies-mw`.
|
|
|
|
|
|
|
|
How can I see the cookies being sent and received from Scrapy?
|
|
|
|
--------------------------------------------------------------
|
|
|
|
|
|
|
|
Enable the :setting:`COOKIES_DEBUG` setting.
|
|
|
|
|
2011-07-12 19:56:39 -03:00
|
|
|
How can I instruct a spider to stop itself?
|
|
|
|
-------------------------------------------
|
2011-07-12 14:24:10 -03:00
|
|
|
|
|
|
|
Raise the :exc:`~scrapy.exceptions.CloseSpider` exception from a callback. For
|
|
|
|
more info see: :exc:`~scrapy.exceptions.CloseSpider`.
|
2011-07-28 00:40:30 -03:00
|
|
|
|
|
|
|
How can I prevent my Scrapy bot from getting banned?
|
|
|
|
----------------------------------------------------
|
|
|
|
|
|
|
|
Some websites implement certain measures to prevent bots from crawling them,
|
|
|
|
with varying degrees of sophistication. Getting around those measures can be
|
|
|
|
difficult and tricky, and may sometimes require special infrastructure.
|
|
|
|
|
|
|
|
Here are some tips to keep in mind when dealing with these kind of sites:
|
|
|
|
|
|
|
|
* rotate your user agent from a pool of well-known ones from browsers (google
|
|
|
|
around to get a list of them)
|
|
|
|
* disable cookies (see :setting:`COOKIES_ENABLED`) as some sites may use
|
|
|
|
cookies to spot bot behaviour
|
|
|
|
* use download delays (2 or higher). See :setting:`DOWNLOAD_DELAY` setting.
|
|
|
|
* is possible, use `Google cache`_ to fetch pages, instead of hitting the sites
|
|
|
|
directly
|
|
|
|
* use a pool of rotating IPs. For example, the free `Tor project`_.
|
|
|
|
|
|
|
|
If you are still unable to prevent your bot getting banned, consider contacting
|
|
|
|
`commercial support`_.
|
|
|
|
|
|
|
|
.. _user agents: http://en.wikipedia.org/wiki/User_agent
|
|
|
|
.. _Google cache: http://www.googleguide.com/cached_pages.html
|
|
|
|
.. _Tor project: https://www.torproject.org/
|
|
|
|
.. _commercial support: http://scrapy.org/support/
|
2011-09-23 13:22:25 -03:00
|
|
|
.. _LIFO: http://en.wikipedia.org/wiki/LIFO
|
|
|
|
.. _DFO order: http://en.wikipedia.org/wiki/Depth-first_search
|
|
|
|
.. _BFO order: http://en.wikipedia.org/wiki/Breadth-first_search
|