1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-24 19:03:54 +00:00

Merge remote-tracking branch 'origin/master' into 1550-shell_file-cont

This commit is contained in:
Paul Tremberth 2016-01-28 14:02:48 +01:00
commit c6f374f2eb
30 changed files with 443 additions and 145 deletions

50
CODE_OF_CONDUCT.md Normal file
View File

@ -0,0 +1,50 @@
# Contributor Code of Conduct
As contributors and maintainers of this project, and in the interest of
fostering an open and welcoming community, we pledge to respect all people who
contribute through reporting issues, posting feature requests, updating
documentation, submitting pull requests or patches, and other activities.
We are committed to making participation in this project a harassment-free
experience for everyone, regardless of level of experience, gender, gender
identity and expression, sexual orientation, disability, personal appearance,
body size, race, ethnicity, age, religion, or nationality.
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery
* Personal attacks
* Trolling or insulting/derogatory comments
* Public or private harassment
* Publishing other's private information, such as physical or electronic
addresses, without explicit permission
* Other unethical or unprofessional conduct
Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.
By adopting this Code of Conduct, project maintainers commit themselves to
fairly and consistently applying these principles to every aspect of managing
this project. Project maintainers who do not follow or enforce the Code of
Conduct may be permanently removed from the project team.
This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community.
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting a project maintainer at opensource@scrapinghub.com. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. Maintainers are
obligated to maintain confidentiality with regard to the reporter of an
incident.
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 1.3.0, available at
[http://contributor-covenant.org/version/1/3/0/][version]
[homepage]: http://contributor-covenant.org
[version]: http://contributor-covenant.org/version/1/3/0/

View File

@ -73,6 +73,12 @@ See http://scrapy.org/community/
Contributing
============
Please note that this project is released with a Contributor Code of Conduct
(see https://github.com/scrapy/scrapy/blob/master/CODE_OF_CONDUCT.md).
By participating in this project you agree to abide by its terms.
Please report unacceptable behavior to opensource@scrapinghub.com.
See http://doc.scrapy.org/en/master/contributing.html
Companies using Scrapy

View File

@ -45,7 +45,7 @@ Did Scrapy "steal" X from Django?
Probably, but we don't like that word. We think Django_ is a great open source
project and an example to follow, so we've used it as an inspiration for
Scrapy.
Scrapy.
We believe that, if something is already done well, there's no need to reinvent
it. This concept, besides being one of the foundations for open source and free
@ -85,6 +85,8 @@ How can I simulate a user login in my spider?
See :ref:`topics-request-response-ref-request-userlogin`.
.. _faq-bfo-dfo:
Does Scrapy crawl in breadth-first or depth-first order?
--------------------------------------------------------

View File

@ -445,10 +445,10 @@ Response objects
.. attribute:: Response.body
A str containing the body of this Response. Keep in mind that Response.body
is always a str. If you want the unicode version use
:meth:`TextResponse.body_as_unicode` (only available in
:class:`TextResponse` and subclasses).
The body of this Response. Keep in mind that Response.body
is always a bytes object. If you want the unicode version use
:attr:`TextResponse.text` (only available in :class:`TextResponse`
and subclasses).
This attribute is read-only. To change the body of a Response use
:meth:`replace`.
@ -542,6 +542,21 @@ TextResponse objects
:class:`TextResponse` objects support the following attributes in addition
to the standard :class:`Response` ones:
.. attribute:: TextResponse.text
Response body, as unicode.
The same as ``response.body.decode(response.encoding)``, but the
result is cached after the first call, so you can access
``response.text`` multiple times without extra overhead.
.. note::
``unicode(response.body)`` is not a correct way to convert response
body to unicode: you would be using the system default encoding
(typically `ascii`) instead of the response encoding.
.. attribute:: TextResponse.encoding
A string with the encoding of this response. The encoding is resolved by
@ -568,20 +583,6 @@ TextResponse objects
:class:`TextResponse` objects support the following methods in addition to
the standard :class:`Response` ones:
.. method:: TextResponse.body_as_unicode()
Returns the body of the response as unicode. This is equivalent to::
response.body.decode(response.encoding)
But **not** equivalent to::
unicode(response.body)
Since, in the latter case, you would be using the system default encoding
(typically `ascii`) to convert the body to unicode, instead of the response
encoding.
.. method:: TextResponse.xpath(query)
A shortcut to ``TextResponse.selector.xpath(query)``::
@ -594,6 +595,11 @@ TextResponse objects
response.css('p')
.. method:: TextResponse.body_as_unicode()
The same as :attr:`text`, but available as a method. This method is
kept for backwards compatibility; please prefer ``response.text``.
HtmlResponse objects
--------------------

View File

@ -276,6 +276,8 @@ DEPTH_LIMIT
Default: ``0``
Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``
The maximum depth that will be allowed to crawl for any site. If zero, no limit
will be imposed.
@ -286,9 +288,24 @@ DEPTH_PRIORITY
Default: ``0``
An integer that is used to adjust the request priority based on its depth.
Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``
If zero, no priority adjustment is made from depth.
An integer that is used to adjust the request priority based on its depth:
- if zero (default), no priority adjustment is made from depth
- **a positive value will decrease the priority, i.e. higher depth
requests will be processed later** ; this is commonly used when doing
breadth-first crawls (BFO)
- a negative value will increase priority, i.e., higher depth requests
will be processed sooner (DFO)
See also: :ref:`faq-bfo-dfo` about tuning Scrapy for BFO or DFO.
.. note::
This setting adjusts priority **in the opposite way** compared to
other priority settings :setting:`REDIRECT_PRIORITY_ADJUST`
and :setting:`RETRY_PRIORITY_ADJUST`.
.. setting:: DEPTH_STATS
@ -297,6 +314,8 @@ DEPTH_STATS
Default: ``True``
Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``
Whether to collect maximum depth stats.
.. setting:: DEPTH_STATS_VERBOSE
@ -306,6 +325,8 @@ DEPTH_STATS_VERBOSE
Default: ``False``
Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``
Whether to collect verbose depth stats. If this is enabled, the number of
requests for each depth is collected in the stats.
@ -750,8 +771,8 @@ Default: ``60.0``
Scope: ``scrapy.extensions.memusage``
The :ref:`Memory usage extension <topics-extensions-ref-memusage>`
checks the current memory usage, versus the limits set by
:setting:`MEMUSAGE_LIMIT_MB` and :setting:`MEMUSAGE_WARNING_MB`,
checks the current memory usage, versus the limits set by
:setting:`MEMUSAGE_LIMIT_MB` and :setting:`MEMUSAGE_WARNING_MB`,
at fixed time intervals.
This sets the length of these intervals, in seconds.
@ -864,8 +885,26 @@ REDIRECT_PRIORITY_ADJUST
Default: ``+2``
Adjust redirect request priority relative to original request.
A negative priority adjust means more priority.
Scope: ``scrapy.downloadermiddlewares.redirect.RedirectMiddleware``
Adjust redirect request priority relative to original request:
- **a positive priority adjust (default) means higher priority.**
- a negative priority adjust means lower priority.
.. setting:: RETRY_PRIORITY_ADJUST
RETRY_PRIORITY_ADJUST
---------------------
Default: ``-1``
Scope: ``scrapy.downloadermiddlewares.retry.RetryMiddleware``
Adjust retry request priority relative to original request:
- a positive priority adjust means higher priority.
- **a negative priority adjust (default) means lower priority.**
.. setting:: ROBOTSTXT_OBEY
@ -877,7 +916,13 @@ Default: ``False``
Scope: ``scrapy.downloadermiddlewares.robotstxt``
If enabled, Scrapy will respect robots.txt policies. For more information see
:ref:`topics-dlmw-robots`
:ref:`topics-dlmw-robots`.
.. note::
While the default value is ``False`` for historical reasons,
this option is enabled by default in settings.py file generated
by ``scrapy startproject`` command.
.. setting:: SCHEDULER
@ -1036,7 +1081,7 @@ TEMPLATES_DIR
Default: ``templates`` dir inside scrapy module
The directory where to look for templates when creating new projects with
:command:`startproject` command and new spiders with :command:`genspider`
:command:`startproject` command and new spiders with :command:`genspider`
command.
The project name must not conflict with the name of custom files or directories

View File

@ -273,6 +273,9 @@ OffsiteMiddleware
This middleware filters out every request whose host names aren't in the
spider's :attr:`~scrapy.spiders.Spider.allowed_domains` attribute.
All subdomains of any domain in the list are also allowed.
E.g. the rule ``www.example.org`` will also allow ``bob.www.example.org``
but not ``www2.example.com`` nor ``example.com``.
When your spider returns a request for a domain not belonging to those
covered by the spider, this middleware will log a debug message similar to

View File

@ -76,7 +76,7 @@ scrapy.Spider
An optional list of strings containing domains that this spider is
allowed to crawl. Requests for URLs not belonging to the domain names
specified in this list won't be followed if
specified in this list (or their subdomains) won't be followed if
:class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware` is enabled.
.. attribute:: start_urls

View File

@ -63,7 +63,7 @@ class AjaxCrawlMiddleware(object):
Return True if a page without hash fragment could be "AJAX crawlable"
according to https://developers.google.com/webmasters/ajax-crawling/docs/getting-started.
"""
body = response.body_as_unicode()[:self.lookup_bytes]
body = response.text[:self.lookup_bytes]
return _has_ajaxcrawlable_meta(body)

View File

@ -83,8 +83,8 @@ class RobotsTxtMiddleware(object):
def _parse_robots(self, response, netloc):
rp = robotparser.RobotFileParser(response.url)
body = ''
if hasattr(response, 'body_as_unicode'):
body = response.body_as_unicode()
if hasattr(response, 'text'):
body = response.text
else: # last effort try
try:
body = response.body.decode('utf-8')

View File

@ -3,6 +3,7 @@ Item Exporters are used to export/serialize items into different formats.
"""
import csv
import io
import sys
import pprint
import marshal
@ -11,7 +12,11 @@ from six.moves import cPickle as pickle
from xml.sax.saxutils import XMLGenerator
from scrapy.utils.serialize import ScrapyJSONEncoder
from scrapy.utils.python import to_bytes, to_unicode, to_native_str, is_listlike
from scrapy.item import BaseItem
from scrapy.exceptions import ScrapyDeprecationWarning
import warnings
__all__ = ['BaseItemExporter', 'PprintItemExporter', 'PickleItemExporter',
'CsvItemExporter', 'XmlItemExporter', 'JsonLinesItemExporter',
@ -38,7 +43,7 @@ class BaseItemExporter(object):
raise NotImplementedError
def serialize_field(self, field, name, value):
serializer = field.get('serializer', self._to_str_if_unicode)
serializer = field.get('serializer', lambda x: x)
return serializer(value)
def start_exporting(self):
@ -47,9 +52,6 @@ class BaseItemExporter(object):
def finish_exporting(self):
pass
def _to_str_if_unicode(self, value):
return value.encode(self.encoding) if isinstance(value, unicode) else value
def _get_serialized_fields(self, item, default_value=None, include_empty=None):
"""Return the fields to export as an iterable of tuples
(name, serialized_value)
@ -86,10 +88,10 @@ class JsonLinesItemExporter(BaseItemExporter):
def export_item(self, item):
itemdict = dict(self._get_serialized_fields(item))
self.file.write(self.encoder.encode(itemdict) + '\n')
self.file.write(to_bytes(self.encoder.encode(itemdict) + '\n'))
class JsonItemExporter(JsonLinesItemExporter):
class JsonItemExporter(BaseItemExporter):
def __init__(self, file, **kwargs):
self._configure(kwargs, dont_fail=True)
@ -98,18 +100,18 @@ class JsonItemExporter(JsonLinesItemExporter):
self.first_item = True
def start_exporting(self):
self.file.write("[")
self.file.write(b"[")
def finish_exporting(self):
self.file.write("]")
self.file.write(b"]")
def export_item(self, item):
if self.first_item:
self.first_item = False
else:
self.file.write(',\n')
self.file.write(b',\n')
itemdict = dict(self._get_serialized_fields(item))
self.file.write(self.encoder.encode(itemdict))
self.file.write(to_bytes(self.encoder.encode(itemdict)))
class XmlItemExporter(BaseItemExporter):
@ -139,7 +141,7 @@ class XmlItemExporter(BaseItemExporter):
if hasattr(serialized_value, 'items'):
for subname, value in serialized_value.items():
self._export_xml_field(subname, value)
elif hasattr(serialized_value, '__iter__'):
elif is_listlike(serialized_value):
for value in serialized_value:
self._export_xml_field('value', value)
else:
@ -153,10 +155,10 @@ class XmlItemExporter(BaseItemExporter):
# and Python 3.x will require unicode, so ">= 2.7.4" should be fine.
if sys.version_info[:3] >= (2, 7, 4):
def _xg_characters(self, serialized_value):
if not isinstance(serialized_value, unicode):
if not isinstance(serialized_value, six.text_type):
serialized_value = serialized_value.decode(self.encoding)
return self.xg.characters(serialized_value)
else:
else: # pragma: no cover
def _xg_characters(self, serialized_value):
return self.xg.characters(serialized_value)
@ -166,17 +168,22 @@ class CsvItemExporter(BaseItemExporter):
def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
self._configure(kwargs, dont_fail=True)
self.include_headers_line = include_headers_line
file = file if six.PY2 else io.TextIOWrapper(file, line_buffering=True)
self.csv_writer = csv.writer(file, **kwargs)
self._headers_not_written = True
self._join_multivalued = join_multivalued
def _to_str_if_unicode(self, value):
def serialize_field(self, field, name, value):
serializer = field.get('serializer', self._join_if_needed)
return serializer(value)
def _join_if_needed(self, value):
if isinstance(value, (list, tuple)):
try:
value = self._join_multivalued.join(value)
return self._join_multivalued.join(value)
except TypeError: # list in value may not contain strings
pass
return super(CsvItemExporter, self)._to_str_if_unicode(value)
return value
def export_item(self, item):
if self._headers_not_written:
@ -185,9 +192,16 @@ class CsvItemExporter(BaseItemExporter):
fields = self._get_serialized_fields(item, default_value='',
include_empty=True)
values = [x[1] for x in fields]
values = list(self._build_row(x for _, x in fields))
self.csv_writer.writerow(values)
def _build_row(self, values):
for s in values:
try:
yield to_native_str(s)
except TypeError:
yield s
def _write_headers_and_set_fields_to_export(self, item):
if self.include_headers_line:
if not self.fields_to_export:
@ -197,7 +211,8 @@ class CsvItemExporter(BaseItemExporter):
else:
# use fields declared in Item
self.fields_to_export = list(item.fields.keys())
self.csv_writer.writerow(self.fields_to_export)
row = list(self._build_row(self.fields_to_export))
self.csv_writer.writerow(row)
class PickleItemExporter(BaseItemExporter):
@ -230,7 +245,7 @@ class PprintItemExporter(BaseItemExporter):
def export_item(self, item):
itemdict = dict(self._get_serialized_fields(item))
self.file.write(pprint.pformat(itemdict) + '\n')
self.file.write(to_bytes(pprint.pformat(itemdict) + '\n'))
class PythonItemExporter(BaseItemExporter):
@ -239,6 +254,13 @@ class PythonItemExporter(BaseItemExporter):
json, msgpack, binc, etc) can be used on top of it. Its main goal is to
seamless support what BaseItemExporter does plus nested items.
"""
def _configure(self, options, dont_fail=False):
self.binary = options.pop('binary', True)
super(PythonItemExporter, self)._configure(options, dont_fail)
if self.binary:
warnings.warn(
"PythonItemExporter will drop support for binary export in the future",
ScrapyDeprecationWarning)
def serialize_field(self, field, name, value):
serializer = field.get('serializer', self._serialize_value)
@ -249,13 +271,20 @@ class PythonItemExporter(BaseItemExporter):
return self.export_item(value)
if isinstance(value, dict):
return dict(self._serialize_dict(value))
if hasattr(value, '__iter__'):
if is_listlike(value):
return [self._serialize_value(v) for v in value]
return self._to_str_if_unicode(value)
encode_func = to_bytes if self.binary else to_unicode
if isinstance(value, (six.text_type, bytes)):
return encode_func(value, encoding=self.encoding)
return value
def _serialize_dict(self, value):
for key, val in six.iteritems(value):
key = to_bytes(key) if self.binary else key
yield key, self._serialize_value(val)
def export_item(self, item):
return dict(self._get_serialized_fields(item))
result = dict(self._get_serialized_fields(item))
if self.binary:
result = dict(self._serialize_dict(result))
return result

View File

@ -9,6 +9,7 @@ from collections import defaultdict
from twisted.internet import reactor
from scrapy import signals
from scrapy.exceptions import NotConfigured
class CloseSpider(object):
@ -23,6 +24,9 @@ class CloseSpider(object):
'errorcount': crawler.settings.getint('CLOSESPIDER_ERRORCOUNT'),
}
if not any(self.close_on.values()):
raise NotConfigured
self.counter = defaultdict(int)
if self.close_on.get('errorcount'):

View File

@ -2,6 +2,7 @@ import os
from six.moves import cPickle as pickle
from scrapy import signals
from scrapy.exceptions import NotConfigured
from scrapy.utils.job import job_dir
class SpiderState(object):
@ -12,7 +13,11 @@ class SpiderState(object):
@classmethod
def from_crawler(cls, crawler):
obj = cls(job_dir(crawler.settings))
jobdir = job_dir(crawler.settings)
if not jobdir:
raise NotConfigured
obj = cls(jobdir)
crawler.signals.connect(obj.spider_closed, signal=signals.spider_closed)
crawler.signals.connect(obj.spider_opened, signal=signals.spider_opened)
return obj

View File

@ -64,8 +64,8 @@ def _urlencode(seq, enc):
def _get_form(response, formname, formid, formnumber, formxpath):
"""Find the form element """
text = response.body_as_unicode()
root = create_root_node(text, lxml.html.HTMLParser, base_url=get_base_url(response))
root = create_root_node(response.text, lxml.html.HTMLParser,
base_url=get_base_url(response))
forms = root.xpath('//form')
if not forms:
raise ValueError("No <form> element found in %s" % response)

View File

@ -59,7 +59,12 @@ class TextResponse(Response):
def body_as_unicode(self):
"""Return body as unicode"""
# check for self.encoding before _cached_ubody just in
return self.text
@property
def text(self):
""" Body as unicode """
# access self.encoding before _cached_ubody to make sure
# _body_inferred_encoding is called
benc = self.encoding
if self._cached_ubody is None:

View File

@ -28,6 +28,7 @@ class MiddlewareManager(object):
def from_settings(cls, settings, crawler=None):
mwlist = cls._get_mwlist_from_settings(settings)
middlewares = []
enabled = []
for clspath in mwlist:
try:
mwcls = load_object(clspath)
@ -38,15 +39,17 @@ class MiddlewareManager(object):
else:
mw = mwcls()
middlewares.append(mw)
enabled.append(clspath)
except NotConfigured as e:
if e.args:
clsname = clspath.split('.')[-1]
logger.warning("Disabled %(clsname)s: %(eargs)s",
{'clsname': clsname, 'eargs': e.args[0]},
extra={'crawler': crawler})
logger.info("Enabled %(componentname)ss:\n%(enabledlist)s",
{'componentname': cls.component_name,
'enabledlist': pprint.pformat(mwlist)},
'enabledlist': pprint.pformat(enabled)},
extra={'crawler': crawler})
return cls(*middlewares)

View File

@ -60,7 +60,7 @@ class Selector(_ParselSelector, object_ref):
response = _response_from_text(text, st)
if response is not None:
text = response.body_as_unicode()
text = response.text
kwargs.setdefault('base_url', response.url)
self.response = response

View File

@ -18,6 +18,9 @@ NEWSPIDER_MODULE = '$project_name.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = '$project_name (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

View File

@ -7,11 +7,22 @@ This module must not depend on any module outside the Standard Library.
import copy
import six
import warnings
from collections import OrderedDict
from scrapy.exceptions import ScrapyDeprecationWarning
class MultiValueDictKeyError(KeyError):
pass
def __init__(self, *args, **kwargs):
warnings.warn(
"scrapy.utils.datatypes.MultiValueDictKeyError is deprecated "
"and will be removed in future releases.",
category=ScrapyDeprecationWarning,
stacklevel=2
)
super(MultiValueDictKeyError, self).__init__(*args, **kwargs)
class MultiValueDict(dict):
"""
@ -31,6 +42,10 @@ class MultiValueDict(dict):
single name-value pairs.
"""
def __init__(self, key_to_list_mapping=()):
warnings.warn("scrapy.utils.datatypes.MultiValueDict is deprecated "
"and will be removed in future releases.",
category=ScrapyDeprecationWarning,
stacklevel=2)
dict.__init__(self, key_to_list_mapping)
def __repr__(self):
@ -137,10 +152,18 @@ class MultiValueDict(dict):
for key, value in six.iteritems(kwargs):
self.setlistdefault(key, []).append(value)
class SiteNode(object):
"""Class to represent a site node (page, image or any other file)"""
def __init__(self, url):
warnings.warn(
"scrapy.utils.datatypes.SiteNode is deprecated "
"and will be removed in future releases.",
category=ScrapyDeprecationWarning,
stacklevel=2
)
self.url = url
self.itemnames = []
self.children = []

View File

@ -137,7 +137,7 @@ def _body_or_str(obj, unicode=True):
if not unicode:
return obj.body
elif isinstance(obj, TextResponse):
return obj.body_as_unicode()
return obj.text
else:
return obj.body.decode('utf-8')
elif isinstance(obj, six.text_type):

View File

@ -25,7 +25,7 @@ _baseurl_cache = weakref.WeakKeyDictionary()
def get_base_url(response):
"""Return the base url of the given response, joined with the response url"""
if response not in _baseurl_cache:
text = response.body_as_unicode()[0:4096]
text = response.text[0:4096]
_baseurl_cache[response] = html.get_base_url(text, response.url,
response.encoding)
return _baseurl_cache[response]
@ -37,7 +37,7 @@ _metaref_cache = weakref.WeakKeyDictionary()
def get_meta_refresh(response):
"""Parse the http-equiv refrsh parameter from the given response"""
if response not in _metaref_cache:
text = response.body_as_unicode()[0:4096]
text = response.text[0:4096]
text = _noscript_re.sub(u'', text)
text = _script_re.sub(u'', text)
_metaref_cache[response] = html.get_meta_refresh(text, response.url,

View File

@ -1,10 +1,5 @@
tests/test_exporters.py
tests/test_linkextractors_deprecated.py
tests/test_mail.py
tests/test_pipeline_files.py
tests/test_pipeline_images.py
tests/test_proxy_connect.py
tests/test_spidermiddleware_httperror.py
scrapy/xlib/tx/iweb.py
scrapy/xlib/tx/interfaces.py
@ -14,12 +9,9 @@ scrapy/xlib/tx/_newclient.py
scrapy/xlib/tx/__init__.py
scrapy/core/downloader/handlers/s3.py
scrapy/core/downloader/handlers/ftp.py
scrapy/pipelines/images.py
scrapy/pipelines/files.py
scrapy/linkextractors/sgml.py
scrapy/linkextractors/regex.py
scrapy/linkextractors/htmlparser.py
scrapy/downloadermiddlewares/cookies.py
scrapy/extensions/statsmailer.py
scrapy/extensions/memusage.py
scrapy/mail.py

View File

@ -4,6 +4,7 @@ pytest-cov
testfixtures
jmespath
leveldb
boto
# optional for shell wrapper tests
bpython
ipython

View File

@ -437,6 +437,8 @@ class S3AnonTestCase(unittest.TestCase):
import boto
except ImportError:
skip = 'missing boto library'
if six.PY3:
skip = 'S3 not supported on Py3'
def setUp(self):
self.s3reqh = S3DownloadHandler(Settings(),
@ -459,6 +461,8 @@ class S3TestCase(unittest.TestCase):
import boto
except ImportError:
skip = 'missing boto library'
if six.PY3:
skip = 'S3 not supported on Py3'
# test use same example keys than amazon developer guide
# http://s3.amazonaws.com/awsdocs/S3/20060301/s3-dg-20060301.pdf

View File

@ -55,12 +55,11 @@ class TestSpider(Spider):
def parse_item(self, response):
item = self.item_cls()
body = response.body_as_unicode()
m = self.name_re.search(body)
m = self.name_re.search(response.text)
if m:
item['name'] = m.group(1)
item['url'] = response.url
m = self.price_re.search(body)
m = self.price_re.search(response.text)
if m:
item['price'] = m.group(1)
return item

View File

@ -1,17 +1,21 @@
from __future__ import absolute_import
import re
import json
import marshal
import tempfile
import unittest
from io import BytesIO
from six.moves import cPickle as pickle
import lxml.etree
import six
from scrapy.item import Item, Field
from scrapy.utils.python import to_unicode
from scrapy.exporters import (
BaseItemExporter, PprintItemExporter, PickleItemExporter, CsvItemExporter,
XmlItemExporter, JsonLinesItemExporter, JsonItemExporter, PythonItemExporter
XmlItemExporter, JsonLinesItemExporter, JsonItemExporter,
PythonItemExporter, MarshalItemExporter
)
@ -23,7 +27,7 @@ class TestItem(Item):
class BaseItemExporterTest(unittest.TestCase):
def setUp(self):
self.i = TestItem(name=u'John\xa3', age='22')
self.i = TestItem(name=u'John\xa3', age=u'22')
self.output = BytesIO()
self.ie = self._get_exporter()
@ -56,19 +60,19 @@ class BaseItemExporterTest(unittest.TestCase):
def test_serialize_field(self):
res = self.ie.serialize_field(self.i.fields['name'], 'name', self.i['name'])
self.assertEqual(res, 'John\xc2\xa3')
self.assertEqual(res, u'John\xa3')
res = self.ie.serialize_field(self.i.fields['age'], 'age', self.i['age'])
self.assertEqual(res, '22')
self.assertEqual(res, u'22')
def test_fields_to_export(self):
ie = self._get_exporter(fields_to_export=['name'])
self.assertEqual(list(ie._get_serialized_fields(self.i)), [('name', 'John\xc2\xa3')])
self.assertEqual(list(ie._get_serialized_fields(self.i)), [('name', u'John\xa3')])
ie = self._get_exporter(fields_to_export=['name'], encoding='latin-1')
name = list(ie._get_serialized_fields(self.i))[0][1]
assert isinstance(name, str)
self.assertEqual(name, 'John\xa3')
_, name = list(ie._get_serialized_fields(self.i))[0]
assert isinstance(name, six.text_type)
self.assertEqual(name, u'John\xa3')
def test_field_custom_serializer(self):
def custom_serializer(value):
@ -78,16 +82,20 @@ class BaseItemExporterTest(unittest.TestCase):
name = Field()
age = Field(serializer=custom_serializer)
i = CustomFieldItem(name=u'John\xa3', age='22')
i = CustomFieldItem(name=u'John\xa3', age=u'22')
ie = self._get_exporter()
self.assertEqual(ie.serialize_field(i.fields['name'], 'name', i['name']), 'John\xc2\xa3')
self.assertEqual(ie.serialize_field(i.fields['name'], 'name', i['name']), u'John\xa3')
self.assertEqual(ie.serialize_field(i.fields['age'], 'age', i['age']), '24')
class PythonItemExporterTest(BaseItemExporterTest):
def _get_exporter(self, **kwargs):
return PythonItemExporter(**kwargs)
return PythonItemExporter(binary=False, **kwargs)
def test_invalid_option(self):
with self.assertRaisesRegexp(TypeError, "Unexpected options: invalid_option"):
PythonItemExporter(invalid_option='something')
def test_nested_item(self):
i1 = TestItem(name=u'Joseph', age='22')
@ -120,6 +128,25 @@ class PythonItemExporterTest(BaseItemExporterTest):
self.assertEqual(type(exported['age'][0]), dict)
self.assertEqual(type(exported['age'][0]['age'][0]), dict)
def test_export_binary(self):
exporter = PythonItemExporter(binary=True)
value = TestItem(name=u'John\xa3', age=u'22')
expected = {b'name': b'John\xc2\xa3', b'age': b'22'}
self.assertEqual(expected, exporter.export_item(value))
def test_other_python_types_item(self):
from datetime import datetime
now = datetime.now()
item = {
'boolean': False,
'number': 22,
'time': now,
'float': 3.14,
}
ie = self._get_exporter()
exported = ie.export_item(item)
self.assertEqual(exported, item)
class PprintItemExporterTest(BaseItemExporterTest):
@ -152,18 +179,30 @@ class PickleItemExporterTest(BaseItemExporterTest):
self.assertEqual(pickle.load(f), i2)
class CsvItemExporterTest(BaseItemExporterTest):
class MarshalItemExporterTest(BaseItemExporterTest):
def _get_exporter(self, **kwargs):
self.output = tempfile.TemporaryFile()
return MarshalItemExporter(self.output, **kwargs)
def _check_output(self):
self.output.seek(0)
self._assert_expected_item(marshal.load(self.output))
class CsvItemExporterTest(BaseItemExporterTest):
def _get_exporter(self, **kwargs):
return CsvItemExporter(self.output, **kwargs)
def assertCsvEqual(self, first, second, msg=None):
first = to_unicode(first)
second = to_unicode(second)
csvsplit = lambda csv: [sorted(re.split(r'(,|\s+)', line))
for line in csv.splitlines(True)]
return self.assertEqual(csvsplit(first), csvsplit(second), msg)
def _check_output(self):
self.assertCsvEqual(self.output.getvalue(), 'age,name\r\n22,John\xc2\xa3\r\n')
self.assertCsvEqual(to_unicode(self.output.getvalue()), u'age,name\r\n22,John\xa3\r\n')
def assertExportResult(self, item, expected, **kwargs):
fp = BytesIO()
@ -177,13 +216,13 @@ class CsvItemExporterTest(BaseItemExporterTest):
self.assertExportResult(
item=self.i,
fields_to_export=self.i.fields.keys(),
expected='age,name\r\n22,John\xc2\xa3\r\n',
expected=b'age,name\r\n22,John\xc2\xa3\r\n',
)
def test_header_export_all_dict(self):
self.assertExportResult(
item=dict(self.i),
expected='age,name\r\n22,John\xc2\xa3\r\n',
expected=b'age,name\r\n22,John\xc2\xa3\r\n',
)
def test_header_export_single_field(self):
@ -191,7 +230,7 @@ class CsvItemExporterTest(BaseItemExporterTest):
self.assertExportResult(
item=item,
fields_to_export=['age'],
expected='age\r\n22\r\n',
expected=b'age\r\n22\r\n',
)
def test_header_export_two_items(self):
@ -202,14 +241,15 @@ class CsvItemExporterTest(BaseItemExporterTest):
ie.export_item(item)
ie.export_item(item)
ie.finish_exporting()
self.assertCsvEqual(output.getvalue(), 'age,name\r\n22,John\xc2\xa3\r\n22,John\xc2\xa3\r\n')
self.assertCsvEqual(output.getvalue(),
b'age,name\r\n22,John\xc2\xa3\r\n22,John\xc2\xa3\r\n')
def test_header_no_header_line(self):
for item in [self.i, dict(self.i)]:
self.assertExportResult(
item=item,
include_headers_line=False,
expected='22,John\xc2\xa3\r\n',
expected=b'22,John\xc2\xa3\r\n',
)
def test_join_multivalue(self):
@ -224,6 +264,28 @@ class CsvItemExporterTest(BaseItemExporterTest):
expected='"Mary,Paul",John\r\n',
)
def test_join_multivalue_not_strings(self):
self.assertExportResult(
item=dict(name='John', friends=[4, 8]),
include_headers_line=False,
expected='"[4, 8]",John\r\n',
)
def test_other_python_types_item(self):
from datetime import datetime
now = datetime(2015, 1, 1, 1, 1, 1)
item = {
'boolean': False,
'number': 22,
'time': now,
'float': 3.14,
}
self.assertExportResult(
item=item,
include_headers_line=False,
expected='22,False,3.14,2015-01-01 01:01:01\r\n'
)
class XmlItemExporterTest(BaseItemExporterTest):
@ -252,13 +314,13 @@ class XmlItemExporterTest(BaseItemExporterTest):
self.assertXmlEquivalent(fp.getvalue(), expected_value)
def _check_output(self):
expected_value = '<?xml version="1.0" encoding="utf-8"?>\n<items><item><age>22</age><name>John\xc2\xa3</name></item></items>'
expected_value = b'<?xml version="1.0" encoding="utf-8"?>\n<items><item><age>22</age><name>John\xc2\xa3</name></item></items>'
self.assertXmlEquivalent(self.output.getvalue(), expected_value)
def test_multivalued_fields(self):
self.assertExportResult(
TestItem(name=[u'John\xa3', u'Doe']),
'<?xml version="1.0" encoding="utf-8"?>\n<items><item><name><value>John\xc2\xa3</value><value>Doe</value></name></item></items>'
b'<?xml version="1.0" encoding="utf-8"?>\n<items><item><name><value>John\xc2\xa3</value><value>Doe</value></name></item></items>'
)
def test_nested_item(self):
@ -267,19 +329,19 @@ class XmlItemExporterTest(BaseItemExporterTest):
i3 = TestItem(name=u'buz', age=i2)
self.assertExportResult(i3,
'<?xml version="1.0" encoding="utf-8"?>\n'
'<items>'
'<item>'
'<age>'
'<age>'
'<age>22</age>'
'<name>foo\xc2\xa3hoo</name>'
'</age>'
'<name>bar</name>'
'</age>'
'<name>buz</name>'
'</item>'
'</items>'
b'<?xml version="1.0" encoding="utf-8"?>\n'
b'<items>'
b'<item>'
b'<age>'
b'<age>'
b'<age>22</age>'
b'<name>foo\xc2\xa3hoo</name>'
b'</age>'
b'<name>bar</name>'
b'</age>'
b'<name>buz</name>'
b'</item>'
b'</items>'
)
def test_nested_list_item(self):
@ -288,16 +350,16 @@ class XmlItemExporterTest(BaseItemExporterTest):
i3 = TestItem(name=u'buz', age=[i1, i2])
self.assertExportResult(i3,
'<?xml version="1.0" encoding="utf-8"?>\n'
'<items>'
'<item>'
'<age>'
'<value><name>foo</name></value>'
'<value><name>bar</name><v2><egg><value>spam</value></egg></v2></value>'
'</age>'
'<name>buz</name>'
'</item>'
'</items>'
b'<?xml version="1.0" encoding="utf-8"?>\n'
b'<items>'
b'<item>'
b'<age>'
b'<value><name>foo</name></value>'
b'<value><name>bar</name><v2><egg><value>spam</value></egg></v2></value>'
b'</age>'
b'<name>buz</name>'
b'</item>'
b'</items>'
)
@ -309,7 +371,7 @@ class JsonLinesItemExporterTest(BaseItemExporterTest):
return JsonLinesItemExporter(self.output, **kwargs)
def _check_output(self):
exported = json.loads(self.output.getvalue().strip())
exported = json.loads(to_unicode(self.output.getvalue().strip()))
self.assertEqual(exported, dict(self.i))
def test_nested_item(self):
@ -319,7 +381,7 @@ class JsonLinesItemExporterTest(BaseItemExporterTest):
self.ie.start_exporting()
self.ie.export_item(i3)
self.ie.finish_exporting()
exported = json.loads(self.output.getvalue())
exported = json.loads(to_unicode(self.output.getvalue()))
self.assertEqual(exported, self._expected_nested)
def test_extra_keywords(self):
@ -337,7 +399,7 @@ class JsonItemExporterTest(JsonLinesItemExporterTest):
return JsonItemExporter(self.output, **kwargs)
def _check_output(self):
exported = json.loads(self.output.getvalue().strip())
exported = json.loads(to_unicode(self.output.getvalue().strip()))
self.assertEqual(exported, [dict(self.i)])
def assertTwoItemsExported(self, item):
@ -345,7 +407,7 @@ class JsonItemExporterTest(JsonLinesItemExporterTest):
self.ie.export_item(item)
self.ie.export_item(item)
self.ie.finish_exporting()
exported = json.loads(self.output.getvalue())
exported = json.loads(to_unicode(self.output.getvalue()))
self.assertEqual(exported, [dict(item), dict(item)])
def test_two_items(self):
@ -361,7 +423,7 @@ class JsonItemExporterTest(JsonLinesItemExporterTest):
self.ie.start_exporting()
self.ie.export_item(i3)
self.ie.finish_exporting()
exported = json.loads(self.output.getvalue())
exported = json.loads(to_unicode(self.output.getvalue()))
expected = {'name': u'Jesus', 'age': {'name': 'Maria', 'age': dict(i1)}}
self.assertEqual(exported, [expected])
@ -372,7 +434,7 @@ class JsonItemExporterTest(JsonLinesItemExporterTest):
self.ie.start_exporting()
self.ie.export_item(i3)
self.ie.finish_exporting()
exported = json.loads(self.output.getvalue())
exported = json.loads(to_unicode(self.output.getvalue()))
expected = {'name': u'Jesus', 'age': {'name': 'Maria', 'age': i1}}
self.assertEqual(exported, [expected])

View File

@ -5,7 +5,6 @@ import json
from io import BytesIO
import tempfile
import shutil
import six
from six.moves.urllib.parse import urlparse
from zope.interface.verify import verifyObject
@ -22,6 +21,7 @@ from scrapy.extensions.feedexport import (
S3FeedStorage, StdoutFeedStorage
)
from scrapy.utils.test import assert_aws_environ
from scrapy.utils.python import to_native_str
class FileFeedStorageTest(unittest.TestCase):
@ -120,8 +120,6 @@ class StdoutFeedStorageTest(unittest.TestCase):
class FeedExportTest(unittest.TestCase):
skip = not six.PY2
class MyItem(scrapy.Item):
foo = scrapy.Field()
egg = scrapy.Field()
@ -170,7 +168,7 @@ class FeedExportTest(unittest.TestCase):
settings.update({'FEED_FORMAT': 'csv'})
data = yield self.exported_data(items, settings)
reader = csv.DictReader(data.splitlines())
reader = csv.DictReader(to_native_str(data).splitlines())
got_rows = list(reader)
if ordered:
self.assertEqual(reader.fieldnames, header)
@ -184,14 +182,57 @@ class FeedExportTest(unittest.TestCase):
settings = settings or {}
settings.update({'FEED_FORMAT': 'jl'})
data = yield self.exported_data(items, settings)
parsed = [json.loads(line) for line in data.splitlines()]
parsed = [json.loads(to_native_str(line)) for line in data.splitlines()]
rows = [{k: v for k, v in row.items() if v} for row in rows]
self.assertEqual(rows, parsed)
@defer.inlineCallbacks
def assertExportedXml(self, items, rows, settings=None):
settings = settings or {}
settings.update({'FEED_FORMAT': 'xml'})
data = yield self.exported_data(items, settings)
rows = [{k: v for k, v in row.items() if v} for row in rows]
import lxml.etree
root = lxml.etree.fromstring(data)
got_rows = [{e.tag: e.text for e in it} for it in root.findall('item')]
self.assertEqual(rows, got_rows)
def _load_until_eof(self, data, load_func):
bytes_output = BytesIO(data)
result = []
while True:
try:
result.append(load_func(bytes_output))
except EOFError:
break
return result
@defer.inlineCallbacks
def assertExportedPickle(self, items, rows, settings=None):
settings = settings or {}
settings.update({'FEED_FORMAT': 'pickle'})
data = yield self.exported_data(items, settings)
expected = [{k: v for k, v in row.items() if v} for row in rows]
import pickle
result = self._load_until_eof(data, load_func=pickle.load)
self.assertEqual(expected, result)
@defer.inlineCallbacks
def assertExportedMarshal(self, items, rows, settings=None):
settings = settings or {}
settings.update({'FEED_FORMAT': 'marshal'})
data = yield self.exported_data(items, settings)
expected = [{k: v for k, v in row.items() if v} for row in rows]
import marshal
result = self._load_until_eof(data, load_func=marshal.load)
self.assertEqual(expected, result)
@defer.inlineCallbacks
def assertExported(self, items, header, rows, settings=None, ordered=True):
yield self.assertExportedCsv(items, header, rows, settings, ordered)
yield self.assertExportedJsonLines(items, rows, settings)
yield self.assertExportedXml(items, rows, settings)
yield self.assertExportedPickle(items, rows, settings)
@defer.inlineCallbacks
def test_export_items(self):

View File

@ -107,9 +107,11 @@ class BaseResponseTest(unittest.TestCase):
body_bytes = body
assert isinstance(response.body, bytes)
assert isinstance(response.text, six.text_type)
self._assert_response_encoding(response, encoding)
self.assertEqual(response.body, body_bytes)
self.assertEqual(response.body_as_unicode(), body_unicode)
self.assertEqual(response.text, body_unicode)
def _assert_response_encoding(self, response, encoding):
self.assertEqual(response.encoding, resolve_encoding(encoding))
@ -171,6 +173,10 @@ class TextResponseTest(BaseResponseTest):
self.assertTrue(isinstance(r1.body_as_unicode(), six.text_type))
self.assertEqual(r1.body_as_unicode(), unicode_string)
# check response.text
self.assertTrue(isinstance(r1.text, six.text_type))
self.assertEqual(r1.text, unicode_string)
def test_encoding(self):
r1 = self.response_class("http://www.example.com", headers={"Content-type": ["text/html; charset=utf-8"]}, body=b"\xc2\xa3")
r2 = self.response_class("http://www.example.com", encoding='utf-8', body=u"\xa3")
@ -219,12 +225,12 @@ class TextResponseTest(BaseResponseTest):
headers={"Content-type": ["text/html; charset=utf-8"]},
body=b"\xef\xbb\xbfWORD\xe3\xab")
self.assertEqual(r6.encoding, 'utf-8')
self.assertEqual(r6.body_as_unicode(), u'WORD\ufffd\ufffd')
self.assertEqual(r6.text, u'WORD\ufffd\ufffd')
def test_bom_is_removed_from_body(self):
# Inferring encoding from body also cache decoded body as sideeffect,
# this test tries to ensure that calling response.encoding and
# response.body_as_unicode() in indistint order doesn't affect final
# response.text in indistint order doesn't affect final
# values for encoding and decoded body.
url = 'http://example.com'
body = b"\xef\xbb\xbfWORD"
@ -233,9 +239,9 @@ class TextResponseTest(BaseResponseTest):
# Test response without content-type and BOM encoding
response = self.response_class(url, body=body)
self.assertEqual(response.encoding, 'utf-8')
self.assertEqual(response.body_as_unicode(), u'WORD')
self.assertEqual(response.text, u'WORD')
response = self.response_class(url, body=body)
self.assertEqual(response.body_as_unicode(), u'WORD')
self.assertEqual(response.text, u'WORD')
self.assertEqual(response.encoding, 'utf-8')
# Body caching sideeffect isn't triggered when encoding is declared in
@ -243,9 +249,9 @@ class TextResponseTest(BaseResponseTest):
# body
response = self.response_class(url, headers=headers, body=body)
self.assertEqual(response.encoding, 'utf-8')
self.assertEqual(response.body_as_unicode(), u'WORD')
self.assertEqual(response.text, u'WORD')
response = self.response_class(url, headers=headers, body=body)
self.assertEqual(response.body_as_unicode(), u'WORD')
self.assertEqual(response.text, u'WORD')
self.assertEqual(response.encoding, 'utf-8')
def test_replace_wrong_encoding(self):
@ -253,18 +259,18 @@ class TextResponseTest(BaseResponseTest):
r = self.response_class("http://www.example.com", encoding='utf-8', body=b'PREFIX\xe3\xabSUFFIX')
# XXX: Policy for replacing invalid chars may suffer minor variations
# but it should always contain the unicode replacement char (u'\ufffd')
assert u'\ufffd' in r.body_as_unicode(), repr(r.body_as_unicode())
assert u'PREFIX' in r.body_as_unicode(), repr(r.body_as_unicode())
assert u'SUFFIX' in r.body_as_unicode(), repr(r.body_as_unicode())
assert u'\ufffd' in r.text, repr(r.text)
assert u'PREFIX' in r.text, repr(r.text)
assert u'SUFFIX' in r.text, repr(r.text)
# Do not destroy html tags due to encoding bugs
r = self.response_class("http://example.com", encoding='utf-8', \
body=b'\xf0<span>value</span>')
assert u'<span>value</span>' in r.body_as_unicode(), repr(r.body_as_unicode())
assert u'<span>value</span>' in r.text, repr(r.text)
# FIXME: This test should pass once we stop using BeautifulSoup's UnicodeDammit in TextResponse
#r = self.response_class("http://www.example.com", body='PREFIX\xe3\xabSUFFIX')
#assert u'\ufffd' in r.body_as_unicode(), repr(r.body_as_unicode())
#r = self.response_class("http://www.example.com", body=b'PREFIX\xe3\xabSUFFIX')
#assert u'\ufffd' in r.text, repr(r.text)
def test_selector(self):
body = b"<html><head><title>Some page</title><body></body></html>"

View File

@ -53,8 +53,8 @@ class MailSenderTest(unittest.TestCase):
self.assertEqual(len(payload), 2)
text, attach = payload
self.assertEqual(text.get_payload(decode=True), 'body')
self.assertEqual(attach.get_payload(decode=True), 'content')
self.assertEqual(text.get_payload(decode=True), b'body')
self.assertEqual(attach.get_payload(decode=True), b'content')
def _catch_mail_sent(self, **kwargs):
self.catched_msg = dict(**kwargs)

View File

@ -16,7 +16,7 @@ class TestOffsiteMiddleware(TestCase):
self.mw.spider_opened(self.spider)
def _get_spiderargs(self):
return dict(name='foo', allowed_domains=['scrapytest.org', 'scrapy.org'])
return dict(name='foo', allowed_domains=['scrapytest.org', 'scrapy.org', 'scrapy.test.org'])
def test_process_spider_output(self):
res = Response('http://scrapytest.org')
@ -24,13 +24,16 @@ class TestOffsiteMiddleware(TestCase):
onsite_reqs = [Request('http://scrapytest.org/1'),
Request('http://scrapy.org/1'),
Request('http://sub.scrapy.org/1'),
Request('http://offsite.tld/letmepass', dont_filter=True)]
Request('http://offsite.tld/letmepass', dont_filter=True),
Request('http://scrapy.test.org/')]
offsite_reqs = [Request('http://scrapy2.org'),
Request('http://offsite.tld/'),
Request('http://offsite.tld/scrapytest.org'),
Request('http://offsite.tld/rogue.scrapytest.org'),
Request('http://rogue.scrapytest.org.haha.com'),
Request('http://roguescrapytest.org')]
Request('http://roguescrapytest.org'),
Request('http://test.org/'),
Request('http://notscrapy.test.org/')]
reqs = onsite_reqs + offsite_reqs
out = list(self.mw.process_spider_output(res, reqs, self.spider))

View File

@ -4,6 +4,8 @@ from twisted.trial import unittest
from scrapy.extensions.spiderstate import SpiderState
from scrapy.spiders import Spider
from scrapy.exceptions import NotConfigured
from scrapy.utils.test import get_crawler
class SpiderStateTest(unittest.TestCase):
@ -34,3 +36,7 @@ class SpiderStateTest(unittest.TestCase):
ss.spider_opened(spider)
self.assertEqual(spider.state, {})
ss.spider_closed(spider)
def test_not_configured(self):
crawler = get_crawler(Spider)
self.assertRaises(NotConfigured, SpiderState.from_crawler, crawler)