mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-24 19:03:54 +00:00
Merge remote-tracking branch 'origin/master' into 1550-shell_file-cont
This commit is contained in:
commit
c6f374f2eb
50
CODE_OF_CONDUCT.md
Normal file
50
CODE_OF_CONDUCT.md
Normal file
@ -0,0 +1,50 @@
|
||||
# Contributor Code of Conduct
|
||||
|
||||
As contributors and maintainers of this project, and in the interest of
|
||||
fostering an open and welcoming community, we pledge to respect all people who
|
||||
contribute through reporting issues, posting feature requests, updating
|
||||
documentation, submitting pull requests or patches, and other activities.
|
||||
|
||||
We are committed to making participation in this project a harassment-free
|
||||
experience for everyone, regardless of level of experience, gender, gender
|
||||
identity and expression, sexual orientation, disability, personal appearance,
|
||||
body size, race, ethnicity, age, religion, or nationality.
|
||||
|
||||
Examples of unacceptable behavior by participants include:
|
||||
|
||||
* The use of sexualized language or imagery
|
||||
* Personal attacks
|
||||
* Trolling or insulting/derogatory comments
|
||||
* Public or private harassment
|
||||
* Publishing other's private information, such as physical or electronic
|
||||
addresses, without explicit permission
|
||||
* Other unethical or unprofessional conduct
|
||||
|
||||
Project maintainers have the right and responsibility to remove, edit, or
|
||||
reject comments, commits, code, wiki edits, issues, and other contributions
|
||||
that are not aligned to this Code of Conduct, or to ban temporarily or
|
||||
permanently any contributor for other behaviors that they deem inappropriate,
|
||||
threatening, offensive, or harmful.
|
||||
|
||||
By adopting this Code of Conduct, project maintainers commit themselves to
|
||||
fairly and consistently applying these principles to every aspect of managing
|
||||
this project. Project maintainers who do not follow or enforce the Code of
|
||||
Conduct may be permanently removed from the project team.
|
||||
|
||||
This Code of Conduct applies both within project spaces and in public spaces
|
||||
when an individual is representing the project or its community.
|
||||
|
||||
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
||||
reported by contacting a project maintainer at opensource@scrapinghub.com. All
|
||||
complaints will be reviewed and investigated and will result in a response that
|
||||
is deemed necessary and appropriate to the circumstances. Maintainers are
|
||||
obligated to maintain confidentiality with regard to the reporter of an
|
||||
incident.
|
||||
|
||||
|
||||
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
|
||||
version 1.3.0, available at
|
||||
[http://contributor-covenant.org/version/1/3/0/][version]
|
||||
|
||||
[homepage]: http://contributor-covenant.org
|
||||
[version]: http://contributor-covenant.org/version/1/3/0/
|
@ -73,6 +73,12 @@ See http://scrapy.org/community/
|
||||
Contributing
|
||||
============
|
||||
|
||||
Please note that this project is released with a Contributor Code of Conduct
|
||||
(see https://github.com/scrapy/scrapy/blob/master/CODE_OF_CONDUCT.md).
|
||||
|
||||
By participating in this project you agree to abide by its terms.
|
||||
Please report unacceptable behavior to opensource@scrapinghub.com.
|
||||
|
||||
See http://doc.scrapy.org/en/master/contributing.html
|
||||
|
||||
Companies using Scrapy
|
||||
|
@ -45,7 +45,7 @@ Did Scrapy "steal" X from Django?
|
||||
|
||||
Probably, but we don't like that word. We think Django_ is a great open source
|
||||
project and an example to follow, so we've used it as an inspiration for
|
||||
Scrapy.
|
||||
Scrapy.
|
||||
|
||||
We believe that, if something is already done well, there's no need to reinvent
|
||||
it. This concept, besides being one of the foundations for open source and free
|
||||
@ -85,6 +85,8 @@ How can I simulate a user login in my spider?
|
||||
|
||||
See :ref:`topics-request-response-ref-request-userlogin`.
|
||||
|
||||
.. _faq-bfo-dfo:
|
||||
|
||||
Does Scrapy crawl in breadth-first or depth-first order?
|
||||
--------------------------------------------------------
|
||||
|
||||
|
@ -445,10 +445,10 @@ Response objects
|
||||
|
||||
.. attribute:: Response.body
|
||||
|
||||
A str containing the body of this Response. Keep in mind that Response.body
|
||||
is always a str. If you want the unicode version use
|
||||
:meth:`TextResponse.body_as_unicode` (only available in
|
||||
:class:`TextResponse` and subclasses).
|
||||
The body of this Response. Keep in mind that Response.body
|
||||
is always a bytes object. If you want the unicode version use
|
||||
:attr:`TextResponse.text` (only available in :class:`TextResponse`
|
||||
and subclasses).
|
||||
|
||||
This attribute is read-only. To change the body of a Response use
|
||||
:meth:`replace`.
|
||||
@ -542,6 +542,21 @@ TextResponse objects
|
||||
:class:`TextResponse` objects support the following attributes in addition
|
||||
to the standard :class:`Response` ones:
|
||||
|
||||
.. attribute:: TextResponse.text
|
||||
|
||||
Response body, as unicode.
|
||||
|
||||
The same as ``response.body.decode(response.encoding)``, but the
|
||||
result is cached after the first call, so you can access
|
||||
``response.text`` multiple times without extra overhead.
|
||||
|
||||
.. note::
|
||||
|
||||
``unicode(response.body)`` is not a correct way to convert response
|
||||
body to unicode: you would be using the system default encoding
|
||||
(typically `ascii`) instead of the response encoding.
|
||||
|
||||
|
||||
.. attribute:: TextResponse.encoding
|
||||
|
||||
A string with the encoding of this response. The encoding is resolved by
|
||||
@ -568,20 +583,6 @@ TextResponse objects
|
||||
:class:`TextResponse` objects support the following methods in addition to
|
||||
the standard :class:`Response` ones:
|
||||
|
||||
.. method:: TextResponse.body_as_unicode()
|
||||
|
||||
Returns the body of the response as unicode. This is equivalent to::
|
||||
|
||||
response.body.decode(response.encoding)
|
||||
|
||||
But **not** equivalent to::
|
||||
|
||||
unicode(response.body)
|
||||
|
||||
Since, in the latter case, you would be using the system default encoding
|
||||
(typically `ascii`) to convert the body to unicode, instead of the response
|
||||
encoding.
|
||||
|
||||
.. method:: TextResponse.xpath(query)
|
||||
|
||||
A shortcut to ``TextResponse.selector.xpath(query)``::
|
||||
@ -594,6 +595,11 @@ TextResponse objects
|
||||
|
||||
response.css('p')
|
||||
|
||||
.. method:: TextResponse.body_as_unicode()
|
||||
|
||||
The same as :attr:`text`, but available as a method. This method is
|
||||
kept for backwards compatibility; please prefer ``response.text``.
|
||||
|
||||
|
||||
HtmlResponse objects
|
||||
--------------------
|
||||
|
@ -276,6 +276,8 @@ DEPTH_LIMIT
|
||||
|
||||
Default: ``0``
|
||||
|
||||
Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``
|
||||
|
||||
The maximum depth that will be allowed to crawl for any site. If zero, no limit
|
||||
will be imposed.
|
||||
|
||||
@ -286,9 +288,24 @@ DEPTH_PRIORITY
|
||||
|
||||
Default: ``0``
|
||||
|
||||
An integer that is used to adjust the request priority based on its depth.
|
||||
Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``
|
||||
|
||||
If zero, no priority adjustment is made from depth.
|
||||
An integer that is used to adjust the request priority based on its depth:
|
||||
|
||||
- if zero (default), no priority adjustment is made from depth
|
||||
- **a positive value will decrease the priority, i.e. higher depth
|
||||
requests will be processed later** ; this is commonly used when doing
|
||||
breadth-first crawls (BFO)
|
||||
- a negative value will increase priority, i.e., higher depth requests
|
||||
will be processed sooner (DFO)
|
||||
|
||||
See also: :ref:`faq-bfo-dfo` about tuning Scrapy for BFO or DFO.
|
||||
|
||||
.. note::
|
||||
|
||||
This setting adjusts priority **in the opposite way** compared to
|
||||
other priority settings :setting:`REDIRECT_PRIORITY_ADJUST`
|
||||
and :setting:`RETRY_PRIORITY_ADJUST`.
|
||||
|
||||
.. setting:: DEPTH_STATS
|
||||
|
||||
@ -297,6 +314,8 @@ DEPTH_STATS
|
||||
|
||||
Default: ``True``
|
||||
|
||||
Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``
|
||||
|
||||
Whether to collect maximum depth stats.
|
||||
|
||||
.. setting:: DEPTH_STATS_VERBOSE
|
||||
@ -306,6 +325,8 @@ DEPTH_STATS_VERBOSE
|
||||
|
||||
Default: ``False``
|
||||
|
||||
Scope: ``scrapy.spidermiddlewares.depth.DepthMiddleware``
|
||||
|
||||
Whether to collect verbose depth stats. If this is enabled, the number of
|
||||
requests for each depth is collected in the stats.
|
||||
|
||||
@ -750,8 +771,8 @@ Default: ``60.0``
|
||||
Scope: ``scrapy.extensions.memusage``
|
||||
|
||||
The :ref:`Memory usage extension <topics-extensions-ref-memusage>`
|
||||
checks the current memory usage, versus the limits set by
|
||||
:setting:`MEMUSAGE_LIMIT_MB` and :setting:`MEMUSAGE_WARNING_MB`,
|
||||
checks the current memory usage, versus the limits set by
|
||||
:setting:`MEMUSAGE_LIMIT_MB` and :setting:`MEMUSAGE_WARNING_MB`,
|
||||
at fixed time intervals.
|
||||
|
||||
This sets the length of these intervals, in seconds.
|
||||
@ -864,8 +885,26 @@ REDIRECT_PRIORITY_ADJUST
|
||||
|
||||
Default: ``+2``
|
||||
|
||||
Adjust redirect request priority relative to original request.
|
||||
A negative priority adjust means more priority.
|
||||
Scope: ``scrapy.downloadermiddlewares.redirect.RedirectMiddleware``
|
||||
|
||||
Adjust redirect request priority relative to original request:
|
||||
|
||||
- **a positive priority adjust (default) means higher priority.**
|
||||
- a negative priority adjust means lower priority.
|
||||
|
||||
.. setting:: RETRY_PRIORITY_ADJUST
|
||||
|
||||
RETRY_PRIORITY_ADJUST
|
||||
---------------------
|
||||
|
||||
Default: ``-1``
|
||||
|
||||
Scope: ``scrapy.downloadermiddlewares.retry.RetryMiddleware``
|
||||
|
||||
Adjust retry request priority relative to original request:
|
||||
|
||||
- a positive priority adjust means higher priority.
|
||||
- **a negative priority adjust (default) means lower priority.**
|
||||
|
||||
.. setting:: ROBOTSTXT_OBEY
|
||||
|
||||
@ -877,7 +916,13 @@ Default: ``False``
|
||||
Scope: ``scrapy.downloadermiddlewares.robotstxt``
|
||||
|
||||
If enabled, Scrapy will respect robots.txt policies. For more information see
|
||||
:ref:`topics-dlmw-robots`
|
||||
:ref:`topics-dlmw-robots`.
|
||||
|
||||
.. note::
|
||||
|
||||
While the default value is ``False`` for historical reasons,
|
||||
this option is enabled by default in settings.py file generated
|
||||
by ``scrapy startproject`` command.
|
||||
|
||||
.. setting:: SCHEDULER
|
||||
|
||||
@ -1036,7 +1081,7 @@ TEMPLATES_DIR
|
||||
Default: ``templates`` dir inside scrapy module
|
||||
|
||||
The directory where to look for templates when creating new projects with
|
||||
:command:`startproject` command and new spiders with :command:`genspider`
|
||||
:command:`startproject` command and new spiders with :command:`genspider`
|
||||
command.
|
||||
|
||||
The project name must not conflict with the name of custom files or directories
|
||||
|
@ -273,6 +273,9 @@ OffsiteMiddleware
|
||||
|
||||
This middleware filters out every request whose host names aren't in the
|
||||
spider's :attr:`~scrapy.spiders.Spider.allowed_domains` attribute.
|
||||
All subdomains of any domain in the list are also allowed.
|
||||
E.g. the rule ``www.example.org`` will also allow ``bob.www.example.org``
|
||||
but not ``www2.example.com`` nor ``example.com``.
|
||||
|
||||
When your spider returns a request for a domain not belonging to those
|
||||
covered by the spider, this middleware will log a debug message similar to
|
||||
|
@ -76,7 +76,7 @@ scrapy.Spider
|
||||
|
||||
An optional list of strings containing domains that this spider is
|
||||
allowed to crawl. Requests for URLs not belonging to the domain names
|
||||
specified in this list won't be followed if
|
||||
specified in this list (or their subdomains) won't be followed if
|
||||
:class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware` is enabled.
|
||||
|
||||
.. attribute:: start_urls
|
||||
|
@ -63,7 +63,7 @@ class AjaxCrawlMiddleware(object):
|
||||
Return True if a page without hash fragment could be "AJAX crawlable"
|
||||
according to https://developers.google.com/webmasters/ajax-crawling/docs/getting-started.
|
||||
"""
|
||||
body = response.body_as_unicode()[:self.lookup_bytes]
|
||||
body = response.text[:self.lookup_bytes]
|
||||
return _has_ajaxcrawlable_meta(body)
|
||||
|
||||
|
||||
|
@ -83,8 +83,8 @@ class RobotsTxtMiddleware(object):
|
||||
def _parse_robots(self, response, netloc):
|
||||
rp = robotparser.RobotFileParser(response.url)
|
||||
body = ''
|
||||
if hasattr(response, 'body_as_unicode'):
|
||||
body = response.body_as_unicode()
|
||||
if hasattr(response, 'text'):
|
||||
body = response.text
|
||||
else: # last effort try
|
||||
try:
|
||||
body = response.body.decode('utf-8')
|
||||
|
@ -3,6 +3,7 @@ Item Exporters are used to export/serialize items into different formats.
|
||||
"""
|
||||
|
||||
import csv
|
||||
import io
|
||||
import sys
|
||||
import pprint
|
||||
import marshal
|
||||
@ -11,7 +12,11 @@ from six.moves import cPickle as pickle
|
||||
from xml.sax.saxutils import XMLGenerator
|
||||
|
||||
from scrapy.utils.serialize import ScrapyJSONEncoder
|
||||
from scrapy.utils.python import to_bytes, to_unicode, to_native_str, is_listlike
|
||||
from scrapy.item import BaseItem
|
||||
from scrapy.exceptions import ScrapyDeprecationWarning
|
||||
import warnings
|
||||
|
||||
|
||||
__all__ = ['BaseItemExporter', 'PprintItemExporter', 'PickleItemExporter',
|
||||
'CsvItemExporter', 'XmlItemExporter', 'JsonLinesItemExporter',
|
||||
@ -38,7 +43,7 @@ class BaseItemExporter(object):
|
||||
raise NotImplementedError
|
||||
|
||||
def serialize_field(self, field, name, value):
|
||||
serializer = field.get('serializer', self._to_str_if_unicode)
|
||||
serializer = field.get('serializer', lambda x: x)
|
||||
return serializer(value)
|
||||
|
||||
def start_exporting(self):
|
||||
@ -47,9 +52,6 @@ class BaseItemExporter(object):
|
||||
def finish_exporting(self):
|
||||
pass
|
||||
|
||||
def _to_str_if_unicode(self, value):
|
||||
return value.encode(self.encoding) if isinstance(value, unicode) else value
|
||||
|
||||
def _get_serialized_fields(self, item, default_value=None, include_empty=None):
|
||||
"""Return the fields to export as an iterable of tuples
|
||||
(name, serialized_value)
|
||||
@ -86,10 +88,10 @@ class JsonLinesItemExporter(BaseItemExporter):
|
||||
|
||||
def export_item(self, item):
|
||||
itemdict = dict(self._get_serialized_fields(item))
|
||||
self.file.write(self.encoder.encode(itemdict) + '\n')
|
||||
self.file.write(to_bytes(self.encoder.encode(itemdict) + '\n'))
|
||||
|
||||
|
||||
class JsonItemExporter(JsonLinesItemExporter):
|
||||
class JsonItemExporter(BaseItemExporter):
|
||||
|
||||
def __init__(self, file, **kwargs):
|
||||
self._configure(kwargs, dont_fail=True)
|
||||
@ -98,18 +100,18 @@ class JsonItemExporter(JsonLinesItemExporter):
|
||||
self.first_item = True
|
||||
|
||||
def start_exporting(self):
|
||||
self.file.write("[")
|
||||
self.file.write(b"[")
|
||||
|
||||
def finish_exporting(self):
|
||||
self.file.write("]")
|
||||
self.file.write(b"]")
|
||||
|
||||
def export_item(self, item):
|
||||
if self.first_item:
|
||||
self.first_item = False
|
||||
else:
|
||||
self.file.write(',\n')
|
||||
self.file.write(b',\n')
|
||||
itemdict = dict(self._get_serialized_fields(item))
|
||||
self.file.write(self.encoder.encode(itemdict))
|
||||
self.file.write(to_bytes(self.encoder.encode(itemdict)))
|
||||
|
||||
|
||||
class XmlItemExporter(BaseItemExporter):
|
||||
@ -139,7 +141,7 @@ class XmlItemExporter(BaseItemExporter):
|
||||
if hasattr(serialized_value, 'items'):
|
||||
for subname, value in serialized_value.items():
|
||||
self._export_xml_field(subname, value)
|
||||
elif hasattr(serialized_value, '__iter__'):
|
||||
elif is_listlike(serialized_value):
|
||||
for value in serialized_value:
|
||||
self._export_xml_field('value', value)
|
||||
else:
|
||||
@ -153,10 +155,10 @@ class XmlItemExporter(BaseItemExporter):
|
||||
# and Python 3.x will require unicode, so ">= 2.7.4" should be fine.
|
||||
if sys.version_info[:3] >= (2, 7, 4):
|
||||
def _xg_characters(self, serialized_value):
|
||||
if not isinstance(serialized_value, unicode):
|
||||
if not isinstance(serialized_value, six.text_type):
|
||||
serialized_value = serialized_value.decode(self.encoding)
|
||||
return self.xg.characters(serialized_value)
|
||||
else:
|
||||
else: # pragma: no cover
|
||||
def _xg_characters(self, serialized_value):
|
||||
return self.xg.characters(serialized_value)
|
||||
|
||||
@ -166,17 +168,22 @@ class CsvItemExporter(BaseItemExporter):
|
||||
def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
|
||||
self._configure(kwargs, dont_fail=True)
|
||||
self.include_headers_line = include_headers_line
|
||||
file = file if six.PY2 else io.TextIOWrapper(file, line_buffering=True)
|
||||
self.csv_writer = csv.writer(file, **kwargs)
|
||||
self._headers_not_written = True
|
||||
self._join_multivalued = join_multivalued
|
||||
|
||||
def _to_str_if_unicode(self, value):
|
||||
def serialize_field(self, field, name, value):
|
||||
serializer = field.get('serializer', self._join_if_needed)
|
||||
return serializer(value)
|
||||
|
||||
def _join_if_needed(self, value):
|
||||
if isinstance(value, (list, tuple)):
|
||||
try:
|
||||
value = self._join_multivalued.join(value)
|
||||
return self._join_multivalued.join(value)
|
||||
except TypeError: # list in value may not contain strings
|
||||
pass
|
||||
return super(CsvItemExporter, self)._to_str_if_unicode(value)
|
||||
return value
|
||||
|
||||
def export_item(self, item):
|
||||
if self._headers_not_written:
|
||||
@ -185,9 +192,16 @@ class CsvItemExporter(BaseItemExporter):
|
||||
|
||||
fields = self._get_serialized_fields(item, default_value='',
|
||||
include_empty=True)
|
||||
values = [x[1] for x in fields]
|
||||
values = list(self._build_row(x for _, x in fields))
|
||||
self.csv_writer.writerow(values)
|
||||
|
||||
def _build_row(self, values):
|
||||
for s in values:
|
||||
try:
|
||||
yield to_native_str(s)
|
||||
except TypeError:
|
||||
yield s
|
||||
|
||||
def _write_headers_and_set_fields_to_export(self, item):
|
||||
if self.include_headers_line:
|
||||
if not self.fields_to_export:
|
||||
@ -197,7 +211,8 @@ class CsvItemExporter(BaseItemExporter):
|
||||
else:
|
||||
# use fields declared in Item
|
||||
self.fields_to_export = list(item.fields.keys())
|
||||
self.csv_writer.writerow(self.fields_to_export)
|
||||
row = list(self._build_row(self.fields_to_export))
|
||||
self.csv_writer.writerow(row)
|
||||
|
||||
|
||||
class PickleItemExporter(BaseItemExporter):
|
||||
@ -230,7 +245,7 @@ class PprintItemExporter(BaseItemExporter):
|
||||
|
||||
def export_item(self, item):
|
||||
itemdict = dict(self._get_serialized_fields(item))
|
||||
self.file.write(pprint.pformat(itemdict) + '\n')
|
||||
self.file.write(to_bytes(pprint.pformat(itemdict) + '\n'))
|
||||
|
||||
|
||||
class PythonItemExporter(BaseItemExporter):
|
||||
@ -239,6 +254,13 @@ class PythonItemExporter(BaseItemExporter):
|
||||
json, msgpack, binc, etc) can be used on top of it. Its main goal is to
|
||||
seamless support what BaseItemExporter does plus nested items.
|
||||
"""
|
||||
def _configure(self, options, dont_fail=False):
|
||||
self.binary = options.pop('binary', True)
|
||||
super(PythonItemExporter, self)._configure(options, dont_fail)
|
||||
if self.binary:
|
||||
warnings.warn(
|
||||
"PythonItemExporter will drop support for binary export in the future",
|
||||
ScrapyDeprecationWarning)
|
||||
|
||||
def serialize_field(self, field, name, value):
|
||||
serializer = field.get('serializer', self._serialize_value)
|
||||
@ -249,13 +271,20 @@ class PythonItemExporter(BaseItemExporter):
|
||||
return self.export_item(value)
|
||||
if isinstance(value, dict):
|
||||
return dict(self._serialize_dict(value))
|
||||
if hasattr(value, '__iter__'):
|
||||
if is_listlike(value):
|
||||
return [self._serialize_value(v) for v in value]
|
||||
return self._to_str_if_unicode(value)
|
||||
encode_func = to_bytes if self.binary else to_unicode
|
||||
if isinstance(value, (six.text_type, bytes)):
|
||||
return encode_func(value, encoding=self.encoding)
|
||||
return value
|
||||
|
||||
def _serialize_dict(self, value):
|
||||
for key, val in six.iteritems(value):
|
||||
key = to_bytes(key) if self.binary else key
|
||||
yield key, self._serialize_value(val)
|
||||
|
||||
def export_item(self, item):
|
||||
return dict(self._get_serialized_fields(item))
|
||||
result = dict(self._get_serialized_fields(item))
|
||||
if self.binary:
|
||||
result = dict(self._serialize_dict(result))
|
||||
return result
|
||||
|
@ -9,6 +9,7 @@ from collections import defaultdict
|
||||
from twisted.internet import reactor
|
||||
|
||||
from scrapy import signals
|
||||
from scrapy.exceptions import NotConfigured
|
||||
|
||||
|
||||
class CloseSpider(object):
|
||||
@ -23,6 +24,9 @@ class CloseSpider(object):
|
||||
'errorcount': crawler.settings.getint('CLOSESPIDER_ERRORCOUNT'),
|
||||
}
|
||||
|
||||
if not any(self.close_on.values()):
|
||||
raise NotConfigured
|
||||
|
||||
self.counter = defaultdict(int)
|
||||
|
||||
if self.close_on.get('errorcount'):
|
||||
|
@ -2,6 +2,7 @@ import os
|
||||
from six.moves import cPickle as pickle
|
||||
|
||||
from scrapy import signals
|
||||
from scrapy.exceptions import NotConfigured
|
||||
from scrapy.utils.job import job_dir
|
||||
|
||||
class SpiderState(object):
|
||||
@ -12,7 +13,11 @@ class SpiderState(object):
|
||||
|
||||
@classmethod
|
||||
def from_crawler(cls, crawler):
|
||||
obj = cls(job_dir(crawler.settings))
|
||||
jobdir = job_dir(crawler.settings)
|
||||
if not jobdir:
|
||||
raise NotConfigured
|
||||
|
||||
obj = cls(jobdir)
|
||||
crawler.signals.connect(obj.spider_closed, signal=signals.spider_closed)
|
||||
crawler.signals.connect(obj.spider_opened, signal=signals.spider_opened)
|
||||
return obj
|
||||
|
@ -64,8 +64,8 @@ def _urlencode(seq, enc):
|
||||
|
||||
def _get_form(response, formname, formid, formnumber, formxpath):
|
||||
"""Find the form element """
|
||||
text = response.body_as_unicode()
|
||||
root = create_root_node(text, lxml.html.HTMLParser, base_url=get_base_url(response))
|
||||
root = create_root_node(response.text, lxml.html.HTMLParser,
|
||||
base_url=get_base_url(response))
|
||||
forms = root.xpath('//form')
|
||||
if not forms:
|
||||
raise ValueError("No <form> element found in %s" % response)
|
||||
|
@ -59,7 +59,12 @@ class TextResponse(Response):
|
||||
|
||||
def body_as_unicode(self):
|
||||
"""Return body as unicode"""
|
||||
# check for self.encoding before _cached_ubody just in
|
||||
return self.text
|
||||
|
||||
@property
|
||||
def text(self):
|
||||
""" Body as unicode """
|
||||
# access self.encoding before _cached_ubody to make sure
|
||||
# _body_inferred_encoding is called
|
||||
benc = self.encoding
|
||||
if self._cached_ubody is None:
|
||||
|
@ -28,6 +28,7 @@ class MiddlewareManager(object):
|
||||
def from_settings(cls, settings, crawler=None):
|
||||
mwlist = cls._get_mwlist_from_settings(settings)
|
||||
middlewares = []
|
||||
enabled = []
|
||||
for clspath in mwlist:
|
||||
try:
|
||||
mwcls = load_object(clspath)
|
||||
@ -38,15 +39,17 @@ class MiddlewareManager(object):
|
||||
else:
|
||||
mw = mwcls()
|
||||
middlewares.append(mw)
|
||||
enabled.append(clspath)
|
||||
except NotConfigured as e:
|
||||
if e.args:
|
||||
clsname = clspath.split('.')[-1]
|
||||
logger.warning("Disabled %(clsname)s: %(eargs)s",
|
||||
{'clsname': clsname, 'eargs': e.args[0]},
|
||||
extra={'crawler': crawler})
|
||||
|
||||
logger.info("Enabled %(componentname)ss:\n%(enabledlist)s",
|
||||
{'componentname': cls.component_name,
|
||||
'enabledlist': pprint.pformat(mwlist)},
|
||||
'enabledlist': pprint.pformat(enabled)},
|
||||
extra={'crawler': crawler})
|
||||
return cls(*middlewares)
|
||||
|
||||
|
@ -60,7 +60,7 @@ class Selector(_ParselSelector, object_ref):
|
||||
response = _response_from_text(text, st)
|
||||
|
||||
if response is not None:
|
||||
text = response.body_as_unicode()
|
||||
text = response.text
|
||||
kwargs.setdefault('base_url', response.url)
|
||||
|
||||
self.response = response
|
||||
|
@ -18,6 +18,9 @@ NEWSPIDER_MODULE = '$project_name.spiders'
|
||||
# Crawl responsibly by identifying yourself (and your website) on the user-agent
|
||||
#USER_AGENT = '$project_name (+http://www.yourdomain.com)'
|
||||
|
||||
# Obey robots.txt rules
|
||||
ROBOTSTXT_OBEY = True
|
||||
|
||||
# Configure maximum concurrent requests performed by Scrapy (default: 16)
|
||||
#CONCURRENT_REQUESTS = 32
|
||||
|
||||
|
@ -7,11 +7,22 @@ This module must not depend on any module outside the Standard Library.
|
||||
|
||||
import copy
|
||||
import six
|
||||
import warnings
|
||||
from collections import OrderedDict
|
||||
|
||||
from scrapy.exceptions import ScrapyDeprecationWarning
|
||||
|
||||
|
||||
class MultiValueDictKeyError(KeyError):
|
||||
pass
|
||||
def __init__(self, *args, **kwargs):
|
||||
warnings.warn(
|
||||
"scrapy.utils.datatypes.MultiValueDictKeyError is deprecated "
|
||||
"and will be removed in future releases.",
|
||||
category=ScrapyDeprecationWarning,
|
||||
stacklevel=2
|
||||
)
|
||||
super(MultiValueDictKeyError, self).__init__(*args, **kwargs)
|
||||
|
||||
|
||||
class MultiValueDict(dict):
|
||||
"""
|
||||
@ -31,6 +42,10 @@ class MultiValueDict(dict):
|
||||
single name-value pairs.
|
||||
"""
|
||||
def __init__(self, key_to_list_mapping=()):
|
||||
warnings.warn("scrapy.utils.datatypes.MultiValueDict is deprecated "
|
||||
"and will be removed in future releases.",
|
||||
category=ScrapyDeprecationWarning,
|
||||
stacklevel=2)
|
||||
dict.__init__(self, key_to_list_mapping)
|
||||
|
||||
def __repr__(self):
|
||||
@ -137,10 +152,18 @@ class MultiValueDict(dict):
|
||||
for key, value in six.iteritems(kwargs):
|
||||
self.setlistdefault(key, []).append(value)
|
||||
|
||||
|
||||
class SiteNode(object):
|
||||
"""Class to represent a site node (page, image or any other file)"""
|
||||
|
||||
def __init__(self, url):
|
||||
warnings.warn(
|
||||
"scrapy.utils.datatypes.SiteNode is deprecated "
|
||||
"and will be removed in future releases.",
|
||||
category=ScrapyDeprecationWarning,
|
||||
stacklevel=2
|
||||
)
|
||||
|
||||
self.url = url
|
||||
self.itemnames = []
|
||||
self.children = []
|
||||
|
@ -137,7 +137,7 @@ def _body_or_str(obj, unicode=True):
|
||||
if not unicode:
|
||||
return obj.body
|
||||
elif isinstance(obj, TextResponse):
|
||||
return obj.body_as_unicode()
|
||||
return obj.text
|
||||
else:
|
||||
return obj.body.decode('utf-8')
|
||||
elif isinstance(obj, six.text_type):
|
||||
|
@ -25,7 +25,7 @@ _baseurl_cache = weakref.WeakKeyDictionary()
|
||||
def get_base_url(response):
|
||||
"""Return the base url of the given response, joined with the response url"""
|
||||
if response not in _baseurl_cache:
|
||||
text = response.body_as_unicode()[0:4096]
|
||||
text = response.text[0:4096]
|
||||
_baseurl_cache[response] = html.get_base_url(text, response.url,
|
||||
response.encoding)
|
||||
return _baseurl_cache[response]
|
||||
@ -37,7 +37,7 @@ _metaref_cache = weakref.WeakKeyDictionary()
|
||||
def get_meta_refresh(response):
|
||||
"""Parse the http-equiv refrsh parameter from the given response"""
|
||||
if response not in _metaref_cache:
|
||||
text = response.body_as_unicode()[0:4096]
|
||||
text = response.text[0:4096]
|
||||
text = _noscript_re.sub(u'', text)
|
||||
text = _script_re.sub(u'', text)
|
||||
_metaref_cache[response] = html.get_meta_refresh(text, response.url,
|
||||
|
@ -1,10 +1,5 @@
|
||||
tests/test_exporters.py
|
||||
tests/test_linkextractors_deprecated.py
|
||||
tests/test_mail.py
|
||||
tests/test_pipeline_files.py
|
||||
tests/test_pipeline_images.py
|
||||
tests/test_proxy_connect.py
|
||||
tests/test_spidermiddleware_httperror.py
|
||||
|
||||
scrapy/xlib/tx/iweb.py
|
||||
scrapy/xlib/tx/interfaces.py
|
||||
@ -14,12 +9,9 @@ scrapy/xlib/tx/_newclient.py
|
||||
scrapy/xlib/tx/__init__.py
|
||||
scrapy/core/downloader/handlers/s3.py
|
||||
scrapy/core/downloader/handlers/ftp.py
|
||||
scrapy/pipelines/images.py
|
||||
scrapy/pipelines/files.py
|
||||
scrapy/linkextractors/sgml.py
|
||||
scrapy/linkextractors/regex.py
|
||||
scrapy/linkextractors/htmlparser.py
|
||||
scrapy/downloadermiddlewares/cookies.py
|
||||
scrapy/extensions/statsmailer.py
|
||||
scrapy/extensions/memusage.py
|
||||
scrapy/mail.py
|
||||
|
@ -4,6 +4,7 @@ pytest-cov
|
||||
testfixtures
|
||||
jmespath
|
||||
leveldb
|
||||
boto
|
||||
# optional for shell wrapper tests
|
||||
bpython
|
||||
ipython
|
||||
|
@ -437,6 +437,8 @@ class S3AnonTestCase(unittest.TestCase):
|
||||
import boto
|
||||
except ImportError:
|
||||
skip = 'missing boto library'
|
||||
if six.PY3:
|
||||
skip = 'S3 not supported on Py3'
|
||||
|
||||
def setUp(self):
|
||||
self.s3reqh = S3DownloadHandler(Settings(),
|
||||
@ -459,6 +461,8 @@ class S3TestCase(unittest.TestCase):
|
||||
import boto
|
||||
except ImportError:
|
||||
skip = 'missing boto library'
|
||||
if six.PY3:
|
||||
skip = 'S3 not supported on Py3'
|
||||
|
||||
# test use same example keys than amazon developer guide
|
||||
# http://s3.amazonaws.com/awsdocs/S3/20060301/s3-dg-20060301.pdf
|
||||
|
@ -55,12 +55,11 @@ class TestSpider(Spider):
|
||||
|
||||
def parse_item(self, response):
|
||||
item = self.item_cls()
|
||||
body = response.body_as_unicode()
|
||||
m = self.name_re.search(body)
|
||||
m = self.name_re.search(response.text)
|
||||
if m:
|
||||
item['name'] = m.group(1)
|
||||
item['url'] = response.url
|
||||
m = self.price_re.search(body)
|
||||
m = self.price_re.search(response.text)
|
||||
if m:
|
||||
item['price'] = m.group(1)
|
||||
return item
|
||||
|
@ -1,17 +1,21 @@
|
||||
from __future__ import absolute_import
|
||||
import re
|
||||
import json
|
||||
import marshal
|
||||
import tempfile
|
||||
import unittest
|
||||
from io import BytesIO
|
||||
from six.moves import cPickle as pickle
|
||||
|
||||
import lxml.etree
|
||||
import six
|
||||
|
||||
from scrapy.item import Item, Field
|
||||
from scrapy.utils.python import to_unicode
|
||||
from scrapy.exporters import (
|
||||
BaseItemExporter, PprintItemExporter, PickleItemExporter, CsvItemExporter,
|
||||
XmlItemExporter, JsonLinesItemExporter, JsonItemExporter, PythonItemExporter
|
||||
XmlItemExporter, JsonLinesItemExporter, JsonItemExporter,
|
||||
PythonItemExporter, MarshalItemExporter
|
||||
)
|
||||
|
||||
|
||||
@ -23,7 +27,7 @@ class TestItem(Item):
|
||||
class BaseItemExporterTest(unittest.TestCase):
|
||||
|
||||
def setUp(self):
|
||||
self.i = TestItem(name=u'John\xa3', age='22')
|
||||
self.i = TestItem(name=u'John\xa3', age=u'22')
|
||||
self.output = BytesIO()
|
||||
self.ie = self._get_exporter()
|
||||
|
||||
@ -56,19 +60,19 @@ class BaseItemExporterTest(unittest.TestCase):
|
||||
|
||||
def test_serialize_field(self):
|
||||
res = self.ie.serialize_field(self.i.fields['name'], 'name', self.i['name'])
|
||||
self.assertEqual(res, 'John\xc2\xa3')
|
||||
self.assertEqual(res, u'John\xa3')
|
||||
|
||||
res = self.ie.serialize_field(self.i.fields['age'], 'age', self.i['age'])
|
||||
self.assertEqual(res, '22')
|
||||
self.assertEqual(res, u'22')
|
||||
|
||||
def test_fields_to_export(self):
|
||||
ie = self._get_exporter(fields_to_export=['name'])
|
||||
self.assertEqual(list(ie._get_serialized_fields(self.i)), [('name', 'John\xc2\xa3')])
|
||||
self.assertEqual(list(ie._get_serialized_fields(self.i)), [('name', u'John\xa3')])
|
||||
|
||||
ie = self._get_exporter(fields_to_export=['name'], encoding='latin-1')
|
||||
name = list(ie._get_serialized_fields(self.i))[0][1]
|
||||
assert isinstance(name, str)
|
||||
self.assertEqual(name, 'John\xa3')
|
||||
_, name = list(ie._get_serialized_fields(self.i))[0]
|
||||
assert isinstance(name, six.text_type)
|
||||
self.assertEqual(name, u'John\xa3')
|
||||
|
||||
def test_field_custom_serializer(self):
|
||||
def custom_serializer(value):
|
||||
@ -78,16 +82,20 @@ class BaseItemExporterTest(unittest.TestCase):
|
||||
name = Field()
|
||||
age = Field(serializer=custom_serializer)
|
||||
|
||||
i = CustomFieldItem(name=u'John\xa3', age='22')
|
||||
i = CustomFieldItem(name=u'John\xa3', age=u'22')
|
||||
|
||||
ie = self._get_exporter()
|
||||
self.assertEqual(ie.serialize_field(i.fields['name'], 'name', i['name']), 'John\xc2\xa3')
|
||||
self.assertEqual(ie.serialize_field(i.fields['name'], 'name', i['name']), u'John\xa3')
|
||||
self.assertEqual(ie.serialize_field(i.fields['age'], 'age', i['age']), '24')
|
||||
|
||||
|
||||
class PythonItemExporterTest(BaseItemExporterTest):
|
||||
def _get_exporter(self, **kwargs):
|
||||
return PythonItemExporter(**kwargs)
|
||||
return PythonItemExporter(binary=False, **kwargs)
|
||||
|
||||
def test_invalid_option(self):
|
||||
with self.assertRaisesRegexp(TypeError, "Unexpected options: invalid_option"):
|
||||
PythonItemExporter(invalid_option='something')
|
||||
|
||||
def test_nested_item(self):
|
||||
i1 = TestItem(name=u'Joseph', age='22')
|
||||
@ -120,6 +128,25 @@ class PythonItemExporterTest(BaseItemExporterTest):
|
||||
self.assertEqual(type(exported['age'][0]), dict)
|
||||
self.assertEqual(type(exported['age'][0]['age'][0]), dict)
|
||||
|
||||
def test_export_binary(self):
|
||||
exporter = PythonItemExporter(binary=True)
|
||||
value = TestItem(name=u'John\xa3', age=u'22')
|
||||
expected = {b'name': b'John\xc2\xa3', b'age': b'22'}
|
||||
self.assertEqual(expected, exporter.export_item(value))
|
||||
|
||||
def test_other_python_types_item(self):
|
||||
from datetime import datetime
|
||||
now = datetime.now()
|
||||
item = {
|
||||
'boolean': False,
|
||||
'number': 22,
|
||||
'time': now,
|
||||
'float': 3.14,
|
||||
}
|
||||
ie = self._get_exporter()
|
||||
exported = ie.export_item(item)
|
||||
self.assertEqual(exported, item)
|
||||
|
||||
|
||||
class PprintItemExporterTest(BaseItemExporterTest):
|
||||
|
||||
@ -152,18 +179,30 @@ class PickleItemExporterTest(BaseItemExporterTest):
|
||||
self.assertEqual(pickle.load(f), i2)
|
||||
|
||||
|
||||
class CsvItemExporterTest(BaseItemExporterTest):
|
||||
class MarshalItemExporterTest(BaseItemExporterTest):
|
||||
|
||||
def _get_exporter(self, **kwargs):
|
||||
self.output = tempfile.TemporaryFile()
|
||||
return MarshalItemExporter(self.output, **kwargs)
|
||||
|
||||
def _check_output(self):
|
||||
self.output.seek(0)
|
||||
self._assert_expected_item(marshal.load(self.output))
|
||||
|
||||
|
||||
class CsvItemExporterTest(BaseItemExporterTest):
|
||||
def _get_exporter(self, **kwargs):
|
||||
return CsvItemExporter(self.output, **kwargs)
|
||||
|
||||
def assertCsvEqual(self, first, second, msg=None):
|
||||
first = to_unicode(first)
|
||||
second = to_unicode(second)
|
||||
csvsplit = lambda csv: [sorted(re.split(r'(,|\s+)', line))
|
||||
for line in csv.splitlines(True)]
|
||||
return self.assertEqual(csvsplit(first), csvsplit(second), msg)
|
||||
|
||||
def _check_output(self):
|
||||
self.assertCsvEqual(self.output.getvalue(), 'age,name\r\n22,John\xc2\xa3\r\n')
|
||||
self.assertCsvEqual(to_unicode(self.output.getvalue()), u'age,name\r\n22,John\xa3\r\n')
|
||||
|
||||
def assertExportResult(self, item, expected, **kwargs):
|
||||
fp = BytesIO()
|
||||
@ -177,13 +216,13 @@ class CsvItemExporterTest(BaseItemExporterTest):
|
||||
self.assertExportResult(
|
||||
item=self.i,
|
||||
fields_to_export=self.i.fields.keys(),
|
||||
expected='age,name\r\n22,John\xc2\xa3\r\n',
|
||||
expected=b'age,name\r\n22,John\xc2\xa3\r\n',
|
||||
)
|
||||
|
||||
def test_header_export_all_dict(self):
|
||||
self.assertExportResult(
|
||||
item=dict(self.i),
|
||||
expected='age,name\r\n22,John\xc2\xa3\r\n',
|
||||
expected=b'age,name\r\n22,John\xc2\xa3\r\n',
|
||||
)
|
||||
|
||||
def test_header_export_single_field(self):
|
||||
@ -191,7 +230,7 @@ class CsvItemExporterTest(BaseItemExporterTest):
|
||||
self.assertExportResult(
|
||||
item=item,
|
||||
fields_to_export=['age'],
|
||||
expected='age\r\n22\r\n',
|
||||
expected=b'age\r\n22\r\n',
|
||||
)
|
||||
|
||||
def test_header_export_two_items(self):
|
||||
@ -202,14 +241,15 @@ class CsvItemExporterTest(BaseItemExporterTest):
|
||||
ie.export_item(item)
|
||||
ie.export_item(item)
|
||||
ie.finish_exporting()
|
||||
self.assertCsvEqual(output.getvalue(), 'age,name\r\n22,John\xc2\xa3\r\n22,John\xc2\xa3\r\n')
|
||||
self.assertCsvEqual(output.getvalue(),
|
||||
b'age,name\r\n22,John\xc2\xa3\r\n22,John\xc2\xa3\r\n')
|
||||
|
||||
def test_header_no_header_line(self):
|
||||
for item in [self.i, dict(self.i)]:
|
||||
self.assertExportResult(
|
||||
item=item,
|
||||
include_headers_line=False,
|
||||
expected='22,John\xc2\xa3\r\n',
|
||||
expected=b'22,John\xc2\xa3\r\n',
|
||||
)
|
||||
|
||||
def test_join_multivalue(self):
|
||||
@ -224,6 +264,28 @@ class CsvItemExporterTest(BaseItemExporterTest):
|
||||
expected='"Mary,Paul",John\r\n',
|
||||
)
|
||||
|
||||
def test_join_multivalue_not_strings(self):
|
||||
self.assertExportResult(
|
||||
item=dict(name='John', friends=[4, 8]),
|
||||
include_headers_line=False,
|
||||
expected='"[4, 8]",John\r\n',
|
||||
)
|
||||
|
||||
def test_other_python_types_item(self):
|
||||
from datetime import datetime
|
||||
now = datetime(2015, 1, 1, 1, 1, 1)
|
||||
item = {
|
||||
'boolean': False,
|
||||
'number': 22,
|
||||
'time': now,
|
||||
'float': 3.14,
|
||||
}
|
||||
self.assertExportResult(
|
||||
item=item,
|
||||
include_headers_line=False,
|
||||
expected='22,False,3.14,2015-01-01 01:01:01\r\n'
|
||||
)
|
||||
|
||||
|
||||
class XmlItemExporterTest(BaseItemExporterTest):
|
||||
|
||||
@ -252,13 +314,13 @@ class XmlItemExporterTest(BaseItemExporterTest):
|
||||
self.assertXmlEquivalent(fp.getvalue(), expected_value)
|
||||
|
||||
def _check_output(self):
|
||||
expected_value = '<?xml version="1.0" encoding="utf-8"?>\n<items><item><age>22</age><name>John\xc2\xa3</name></item></items>'
|
||||
expected_value = b'<?xml version="1.0" encoding="utf-8"?>\n<items><item><age>22</age><name>John\xc2\xa3</name></item></items>'
|
||||
self.assertXmlEquivalent(self.output.getvalue(), expected_value)
|
||||
|
||||
def test_multivalued_fields(self):
|
||||
self.assertExportResult(
|
||||
TestItem(name=[u'John\xa3', u'Doe']),
|
||||
'<?xml version="1.0" encoding="utf-8"?>\n<items><item><name><value>John\xc2\xa3</value><value>Doe</value></name></item></items>'
|
||||
b'<?xml version="1.0" encoding="utf-8"?>\n<items><item><name><value>John\xc2\xa3</value><value>Doe</value></name></item></items>'
|
||||
)
|
||||
|
||||
def test_nested_item(self):
|
||||
@ -267,19 +329,19 @@ class XmlItemExporterTest(BaseItemExporterTest):
|
||||
i3 = TestItem(name=u'buz', age=i2)
|
||||
|
||||
self.assertExportResult(i3,
|
||||
'<?xml version="1.0" encoding="utf-8"?>\n'
|
||||
'<items>'
|
||||
'<item>'
|
||||
'<age>'
|
||||
'<age>'
|
||||
'<age>22</age>'
|
||||
'<name>foo\xc2\xa3hoo</name>'
|
||||
'</age>'
|
||||
'<name>bar</name>'
|
||||
'</age>'
|
||||
'<name>buz</name>'
|
||||
'</item>'
|
||||
'</items>'
|
||||
b'<?xml version="1.0" encoding="utf-8"?>\n'
|
||||
b'<items>'
|
||||
b'<item>'
|
||||
b'<age>'
|
||||
b'<age>'
|
||||
b'<age>22</age>'
|
||||
b'<name>foo\xc2\xa3hoo</name>'
|
||||
b'</age>'
|
||||
b'<name>bar</name>'
|
||||
b'</age>'
|
||||
b'<name>buz</name>'
|
||||
b'</item>'
|
||||
b'</items>'
|
||||
)
|
||||
|
||||
def test_nested_list_item(self):
|
||||
@ -288,16 +350,16 @@ class XmlItemExporterTest(BaseItemExporterTest):
|
||||
i3 = TestItem(name=u'buz', age=[i1, i2])
|
||||
|
||||
self.assertExportResult(i3,
|
||||
'<?xml version="1.0" encoding="utf-8"?>\n'
|
||||
'<items>'
|
||||
'<item>'
|
||||
'<age>'
|
||||
'<value><name>foo</name></value>'
|
||||
'<value><name>bar</name><v2><egg><value>spam</value></egg></v2></value>'
|
||||
'</age>'
|
||||
'<name>buz</name>'
|
||||
'</item>'
|
||||
'</items>'
|
||||
b'<?xml version="1.0" encoding="utf-8"?>\n'
|
||||
b'<items>'
|
||||
b'<item>'
|
||||
b'<age>'
|
||||
b'<value><name>foo</name></value>'
|
||||
b'<value><name>bar</name><v2><egg><value>spam</value></egg></v2></value>'
|
||||
b'</age>'
|
||||
b'<name>buz</name>'
|
||||
b'</item>'
|
||||
b'</items>'
|
||||
)
|
||||
|
||||
|
||||
@ -309,7 +371,7 @@ class JsonLinesItemExporterTest(BaseItemExporterTest):
|
||||
return JsonLinesItemExporter(self.output, **kwargs)
|
||||
|
||||
def _check_output(self):
|
||||
exported = json.loads(self.output.getvalue().strip())
|
||||
exported = json.loads(to_unicode(self.output.getvalue().strip()))
|
||||
self.assertEqual(exported, dict(self.i))
|
||||
|
||||
def test_nested_item(self):
|
||||
@ -319,7 +381,7 @@ class JsonLinesItemExporterTest(BaseItemExporterTest):
|
||||
self.ie.start_exporting()
|
||||
self.ie.export_item(i3)
|
||||
self.ie.finish_exporting()
|
||||
exported = json.loads(self.output.getvalue())
|
||||
exported = json.loads(to_unicode(self.output.getvalue()))
|
||||
self.assertEqual(exported, self._expected_nested)
|
||||
|
||||
def test_extra_keywords(self):
|
||||
@ -337,7 +399,7 @@ class JsonItemExporterTest(JsonLinesItemExporterTest):
|
||||
return JsonItemExporter(self.output, **kwargs)
|
||||
|
||||
def _check_output(self):
|
||||
exported = json.loads(self.output.getvalue().strip())
|
||||
exported = json.loads(to_unicode(self.output.getvalue().strip()))
|
||||
self.assertEqual(exported, [dict(self.i)])
|
||||
|
||||
def assertTwoItemsExported(self, item):
|
||||
@ -345,7 +407,7 @@ class JsonItemExporterTest(JsonLinesItemExporterTest):
|
||||
self.ie.export_item(item)
|
||||
self.ie.export_item(item)
|
||||
self.ie.finish_exporting()
|
||||
exported = json.loads(self.output.getvalue())
|
||||
exported = json.loads(to_unicode(self.output.getvalue()))
|
||||
self.assertEqual(exported, [dict(item), dict(item)])
|
||||
|
||||
def test_two_items(self):
|
||||
@ -361,7 +423,7 @@ class JsonItemExporterTest(JsonLinesItemExporterTest):
|
||||
self.ie.start_exporting()
|
||||
self.ie.export_item(i3)
|
||||
self.ie.finish_exporting()
|
||||
exported = json.loads(self.output.getvalue())
|
||||
exported = json.loads(to_unicode(self.output.getvalue()))
|
||||
expected = {'name': u'Jesus', 'age': {'name': 'Maria', 'age': dict(i1)}}
|
||||
self.assertEqual(exported, [expected])
|
||||
|
||||
@ -372,7 +434,7 @@ class JsonItemExporterTest(JsonLinesItemExporterTest):
|
||||
self.ie.start_exporting()
|
||||
self.ie.export_item(i3)
|
||||
self.ie.finish_exporting()
|
||||
exported = json.loads(self.output.getvalue())
|
||||
exported = json.loads(to_unicode(self.output.getvalue()))
|
||||
expected = {'name': u'Jesus', 'age': {'name': 'Maria', 'age': i1}}
|
||||
self.assertEqual(exported, [expected])
|
||||
|
||||
|
@ -5,7 +5,6 @@ import json
|
||||
from io import BytesIO
|
||||
import tempfile
|
||||
import shutil
|
||||
import six
|
||||
from six.moves.urllib.parse import urlparse
|
||||
|
||||
from zope.interface.verify import verifyObject
|
||||
@ -22,6 +21,7 @@ from scrapy.extensions.feedexport import (
|
||||
S3FeedStorage, StdoutFeedStorage
|
||||
)
|
||||
from scrapy.utils.test import assert_aws_environ
|
||||
from scrapy.utils.python import to_native_str
|
||||
|
||||
|
||||
class FileFeedStorageTest(unittest.TestCase):
|
||||
@ -120,8 +120,6 @@ class StdoutFeedStorageTest(unittest.TestCase):
|
||||
|
||||
class FeedExportTest(unittest.TestCase):
|
||||
|
||||
skip = not six.PY2
|
||||
|
||||
class MyItem(scrapy.Item):
|
||||
foo = scrapy.Field()
|
||||
egg = scrapy.Field()
|
||||
@ -170,7 +168,7 @@ class FeedExportTest(unittest.TestCase):
|
||||
settings.update({'FEED_FORMAT': 'csv'})
|
||||
data = yield self.exported_data(items, settings)
|
||||
|
||||
reader = csv.DictReader(data.splitlines())
|
||||
reader = csv.DictReader(to_native_str(data).splitlines())
|
||||
got_rows = list(reader)
|
||||
if ordered:
|
||||
self.assertEqual(reader.fieldnames, header)
|
||||
@ -184,14 +182,57 @@ class FeedExportTest(unittest.TestCase):
|
||||
settings = settings or {}
|
||||
settings.update({'FEED_FORMAT': 'jl'})
|
||||
data = yield self.exported_data(items, settings)
|
||||
parsed = [json.loads(line) for line in data.splitlines()]
|
||||
parsed = [json.loads(to_native_str(line)) for line in data.splitlines()]
|
||||
rows = [{k: v for k, v in row.items() if v} for row in rows]
|
||||
self.assertEqual(rows, parsed)
|
||||
|
||||
@defer.inlineCallbacks
|
||||
def assertExportedXml(self, items, rows, settings=None):
|
||||
settings = settings or {}
|
||||
settings.update({'FEED_FORMAT': 'xml'})
|
||||
data = yield self.exported_data(items, settings)
|
||||
rows = [{k: v for k, v in row.items() if v} for row in rows]
|
||||
import lxml.etree
|
||||
root = lxml.etree.fromstring(data)
|
||||
got_rows = [{e.tag: e.text for e in it} for it in root.findall('item')]
|
||||
self.assertEqual(rows, got_rows)
|
||||
|
||||
def _load_until_eof(self, data, load_func):
|
||||
bytes_output = BytesIO(data)
|
||||
result = []
|
||||
while True:
|
||||
try:
|
||||
result.append(load_func(bytes_output))
|
||||
except EOFError:
|
||||
break
|
||||
return result
|
||||
|
||||
@defer.inlineCallbacks
|
||||
def assertExportedPickle(self, items, rows, settings=None):
|
||||
settings = settings or {}
|
||||
settings.update({'FEED_FORMAT': 'pickle'})
|
||||
data = yield self.exported_data(items, settings)
|
||||
expected = [{k: v for k, v in row.items() if v} for row in rows]
|
||||
import pickle
|
||||
result = self._load_until_eof(data, load_func=pickle.load)
|
||||
self.assertEqual(expected, result)
|
||||
|
||||
@defer.inlineCallbacks
|
||||
def assertExportedMarshal(self, items, rows, settings=None):
|
||||
settings = settings or {}
|
||||
settings.update({'FEED_FORMAT': 'marshal'})
|
||||
data = yield self.exported_data(items, settings)
|
||||
expected = [{k: v for k, v in row.items() if v} for row in rows]
|
||||
import marshal
|
||||
result = self._load_until_eof(data, load_func=marshal.load)
|
||||
self.assertEqual(expected, result)
|
||||
|
||||
@defer.inlineCallbacks
|
||||
def assertExported(self, items, header, rows, settings=None, ordered=True):
|
||||
yield self.assertExportedCsv(items, header, rows, settings, ordered)
|
||||
yield self.assertExportedJsonLines(items, rows, settings)
|
||||
yield self.assertExportedXml(items, rows, settings)
|
||||
yield self.assertExportedPickle(items, rows, settings)
|
||||
|
||||
@defer.inlineCallbacks
|
||||
def test_export_items(self):
|
||||
|
@ -107,9 +107,11 @@ class BaseResponseTest(unittest.TestCase):
|
||||
body_bytes = body
|
||||
|
||||
assert isinstance(response.body, bytes)
|
||||
assert isinstance(response.text, six.text_type)
|
||||
self._assert_response_encoding(response, encoding)
|
||||
self.assertEqual(response.body, body_bytes)
|
||||
self.assertEqual(response.body_as_unicode(), body_unicode)
|
||||
self.assertEqual(response.text, body_unicode)
|
||||
|
||||
def _assert_response_encoding(self, response, encoding):
|
||||
self.assertEqual(response.encoding, resolve_encoding(encoding))
|
||||
@ -171,6 +173,10 @@ class TextResponseTest(BaseResponseTest):
|
||||
self.assertTrue(isinstance(r1.body_as_unicode(), six.text_type))
|
||||
self.assertEqual(r1.body_as_unicode(), unicode_string)
|
||||
|
||||
# check response.text
|
||||
self.assertTrue(isinstance(r1.text, six.text_type))
|
||||
self.assertEqual(r1.text, unicode_string)
|
||||
|
||||
def test_encoding(self):
|
||||
r1 = self.response_class("http://www.example.com", headers={"Content-type": ["text/html; charset=utf-8"]}, body=b"\xc2\xa3")
|
||||
r2 = self.response_class("http://www.example.com", encoding='utf-8', body=u"\xa3")
|
||||
@ -219,12 +225,12 @@ class TextResponseTest(BaseResponseTest):
|
||||
headers={"Content-type": ["text/html; charset=utf-8"]},
|
||||
body=b"\xef\xbb\xbfWORD\xe3\xab")
|
||||
self.assertEqual(r6.encoding, 'utf-8')
|
||||
self.assertEqual(r6.body_as_unicode(), u'WORD\ufffd\ufffd')
|
||||
self.assertEqual(r6.text, u'WORD\ufffd\ufffd')
|
||||
|
||||
def test_bom_is_removed_from_body(self):
|
||||
# Inferring encoding from body also cache decoded body as sideeffect,
|
||||
# this test tries to ensure that calling response.encoding and
|
||||
# response.body_as_unicode() in indistint order doesn't affect final
|
||||
# response.text in indistint order doesn't affect final
|
||||
# values for encoding and decoded body.
|
||||
url = 'http://example.com'
|
||||
body = b"\xef\xbb\xbfWORD"
|
||||
@ -233,9 +239,9 @@ class TextResponseTest(BaseResponseTest):
|
||||
# Test response without content-type and BOM encoding
|
||||
response = self.response_class(url, body=body)
|
||||
self.assertEqual(response.encoding, 'utf-8')
|
||||
self.assertEqual(response.body_as_unicode(), u'WORD')
|
||||
self.assertEqual(response.text, u'WORD')
|
||||
response = self.response_class(url, body=body)
|
||||
self.assertEqual(response.body_as_unicode(), u'WORD')
|
||||
self.assertEqual(response.text, u'WORD')
|
||||
self.assertEqual(response.encoding, 'utf-8')
|
||||
|
||||
# Body caching sideeffect isn't triggered when encoding is declared in
|
||||
@ -243,9 +249,9 @@ class TextResponseTest(BaseResponseTest):
|
||||
# body
|
||||
response = self.response_class(url, headers=headers, body=body)
|
||||
self.assertEqual(response.encoding, 'utf-8')
|
||||
self.assertEqual(response.body_as_unicode(), u'WORD')
|
||||
self.assertEqual(response.text, u'WORD')
|
||||
response = self.response_class(url, headers=headers, body=body)
|
||||
self.assertEqual(response.body_as_unicode(), u'WORD')
|
||||
self.assertEqual(response.text, u'WORD')
|
||||
self.assertEqual(response.encoding, 'utf-8')
|
||||
|
||||
def test_replace_wrong_encoding(self):
|
||||
@ -253,18 +259,18 @@ class TextResponseTest(BaseResponseTest):
|
||||
r = self.response_class("http://www.example.com", encoding='utf-8', body=b'PREFIX\xe3\xabSUFFIX')
|
||||
# XXX: Policy for replacing invalid chars may suffer minor variations
|
||||
# but it should always contain the unicode replacement char (u'\ufffd')
|
||||
assert u'\ufffd' in r.body_as_unicode(), repr(r.body_as_unicode())
|
||||
assert u'PREFIX' in r.body_as_unicode(), repr(r.body_as_unicode())
|
||||
assert u'SUFFIX' in r.body_as_unicode(), repr(r.body_as_unicode())
|
||||
assert u'\ufffd' in r.text, repr(r.text)
|
||||
assert u'PREFIX' in r.text, repr(r.text)
|
||||
assert u'SUFFIX' in r.text, repr(r.text)
|
||||
|
||||
# Do not destroy html tags due to encoding bugs
|
||||
r = self.response_class("http://example.com", encoding='utf-8', \
|
||||
body=b'\xf0<span>value</span>')
|
||||
assert u'<span>value</span>' in r.body_as_unicode(), repr(r.body_as_unicode())
|
||||
assert u'<span>value</span>' in r.text, repr(r.text)
|
||||
|
||||
# FIXME: This test should pass once we stop using BeautifulSoup's UnicodeDammit in TextResponse
|
||||
#r = self.response_class("http://www.example.com", body='PREFIX\xe3\xabSUFFIX')
|
||||
#assert u'\ufffd' in r.body_as_unicode(), repr(r.body_as_unicode())
|
||||
#r = self.response_class("http://www.example.com", body=b'PREFIX\xe3\xabSUFFIX')
|
||||
#assert u'\ufffd' in r.text, repr(r.text)
|
||||
|
||||
def test_selector(self):
|
||||
body = b"<html><head><title>Some page</title><body></body></html>"
|
||||
|
@ -53,8 +53,8 @@ class MailSenderTest(unittest.TestCase):
|
||||
self.assertEqual(len(payload), 2)
|
||||
|
||||
text, attach = payload
|
||||
self.assertEqual(text.get_payload(decode=True), 'body')
|
||||
self.assertEqual(attach.get_payload(decode=True), 'content')
|
||||
self.assertEqual(text.get_payload(decode=True), b'body')
|
||||
self.assertEqual(attach.get_payload(decode=True), b'content')
|
||||
|
||||
def _catch_mail_sent(self, **kwargs):
|
||||
self.catched_msg = dict(**kwargs)
|
||||
|
@ -16,7 +16,7 @@ class TestOffsiteMiddleware(TestCase):
|
||||
self.mw.spider_opened(self.spider)
|
||||
|
||||
def _get_spiderargs(self):
|
||||
return dict(name='foo', allowed_domains=['scrapytest.org', 'scrapy.org'])
|
||||
return dict(name='foo', allowed_domains=['scrapytest.org', 'scrapy.org', 'scrapy.test.org'])
|
||||
|
||||
def test_process_spider_output(self):
|
||||
res = Response('http://scrapytest.org')
|
||||
@ -24,13 +24,16 @@ class TestOffsiteMiddleware(TestCase):
|
||||
onsite_reqs = [Request('http://scrapytest.org/1'),
|
||||
Request('http://scrapy.org/1'),
|
||||
Request('http://sub.scrapy.org/1'),
|
||||
Request('http://offsite.tld/letmepass', dont_filter=True)]
|
||||
Request('http://offsite.tld/letmepass', dont_filter=True),
|
||||
Request('http://scrapy.test.org/')]
|
||||
offsite_reqs = [Request('http://scrapy2.org'),
|
||||
Request('http://offsite.tld/'),
|
||||
Request('http://offsite.tld/scrapytest.org'),
|
||||
Request('http://offsite.tld/rogue.scrapytest.org'),
|
||||
Request('http://rogue.scrapytest.org.haha.com'),
|
||||
Request('http://roguescrapytest.org')]
|
||||
Request('http://roguescrapytest.org'),
|
||||
Request('http://test.org/'),
|
||||
Request('http://notscrapy.test.org/')]
|
||||
reqs = onsite_reqs + offsite_reqs
|
||||
|
||||
out = list(self.mw.process_spider_output(res, reqs, self.spider))
|
||||
|
@ -4,6 +4,8 @@ from twisted.trial import unittest
|
||||
|
||||
from scrapy.extensions.spiderstate import SpiderState
|
||||
from scrapy.spiders import Spider
|
||||
from scrapy.exceptions import NotConfigured
|
||||
from scrapy.utils.test import get_crawler
|
||||
|
||||
|
||||
class SpiderStateTest(unittest.TestCase):
|
||||
@ -34,3 +36,7 @@ class SpiderStateTest(unittest.TestCase):
|
||||
ss.spider_opened(spider)
|
||||
self.assertEqual(spider.state, {})
|
||||
ss.spider_closed(spider)
|
||||
|
||||
def test_not_configured(self):
|
||||
crawler = get_crawler(Spider)
|
||||
self.assertRaises(NotConfigured, SpiderState.from_crawler, crawler)
|
||||
|
Loading…
x
Reference in New Issue
Block a user