mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-06 11:00:46 +00:00
reaplying black, fixing conflicts and ignoring bandit checks on test directory
This commit is contained in:
parent
23e8b553b4
commit
8ee4817471
@ -17,3 +17,4 @@ skips:
|
||||
- B503
|
||||
- B603
|
||||
- B605
|
||||
exclude_dirs: ['tests']
|
||||
|
4
.github/workflows/checks.yml
vendored
4
.github/workflows/checks.yml
vendored
@ -14,9 +14,7 @@ jobs:
|
||||
- python-version: "3.11"
|
||||
env:
|
||||
TOXENV: flake8
|
||||
# Pylint requires installing reppy, which does not support Python 3.9
|
||||
# https://github.com/seomoz/reppy/issues/122
|
||||
- python-version: 3.8
|
||||
- python-version: "3.11"
|
||||
env:
|
||||
TOXENV: pylint
|
||||
- python-version: 3.7
|
||||
|
4
.github/workflows/publish.yml
vendored
4
.github/workflows/publish.yml
vendored
@ -24,8 +24,8 @@ jobs:
|
||||
- name: Publish to PyPI
|
||||
if: steps.check-release-tag.outputs.release_tag == 'true'
|
||||
run: |
|
||||
pip install --upgrade setuptools wheel twine
|
||||
python setup.py sdist bdist_wheel
|
||||
pip install --upgrade build twine
|
||||
python -m build
|
||||
export TWINE_USERNAME=__token__
|
||||
export TWINE_PASSWORD=${{ secrets.PYPI_TOKEN }}
|
||||
twine upload dist/*
|
||||
|
5
.github/workflows/tests-ubuntu.yml
vendored
5
.github/workflows/tests-ubuntu.yml
vendored
@ -38,10 +38,7 @@ jobs:
|
||||
env:
|
||||
TOXENV: pypy3-pinned
|
||||
|
||||
# extras
|
||||
# extra-deps includes reppy, which does not support Python 3.9
|
||||
# https://github.com/seomoz/reppy/issues/122
|
||||
- python-version: 3.8
|
||||
- python-version: "3.11"
|
||||
env:
|
||||
TOXENV: extra-deps
|
||||
|
||||
|
@ -4700,7 +4700,7 @@ Scrapy 0.22.1 (released 2014-02-08)
|
||||
- BaseSgmlLinkExtractor: Added unit test of a link with an inner tag (:commit:`c1cb418`)
|
||||
- BaseSgmlLinkExtractor: Fixed unknown_endtag() so that it only set current_link=None when the end tag match the opening tag (:commit:`7e4d627`)
|
||||
- Fix tests for Travis-CI build (:commit:`76c7e20`)
|
||||
- replace unencodable codepoints with html entities. fixes #562 and #285 (:commit:`5f87b17`)
|
||||
- replace unencodeable codepoints with html entities. fixes #562 and #285 (:commit:`5f87b17`)
|
||||
- RegexLinkExtractor: encode URL unicode value when creating Links (:commit:`d0ee545`)
|
||||
- Updated the tutorial crawl output with latest output. (:commit:`8da65de`)
|
||||
- Updated shell docs with the crawler reference and fixed the actual shell output. (:commit:`875b9ab`)
|
||||
@ -4725,7 +4725,7 @@ Enhancements
|
||||
- [**Backward incompatible**] Switched HTTPCacheMiddleware backend to filesystem (:issue:`541`)
|
||||
To restore old backend set ``HTTPCACHE_STORAGE`` to ``scrapy.contrib.httpcache.DbmCacheStorage``
|
||||
- Proxy \https:// urls using CONNECT method (:issue:`392`, :issue:`397`)
|
||||
- Add a middleware to crawl ajax crawleable pages as defined by google (:issue:`343`)
|
||||
- Add a middleware to crawl ajax crawlable pages as defined by google (:issue:`343`)
|
||||
- Rename scrapy.spider.BaseSpider to scrapy.spider.Spider (:issue:`510`, :issue:`519`)
|
||||
- Selectors register EXSLT namespaces by default (:issue:`472`)
|
||||
- Unify item loaders similar to selectors renaming (:issue:`461`)
|
||||
@ -4905,7 +4905,7 @@ Scrapy 0.18.0 (released 2013-08-09)
|
||||
-----------------------------------
|
||||
|
||||
- Lot of improvements to testsuite run using Tox, including a way to test on pypi
|
||||
- Handle GET parameters for AJAX crawleable urls (:commit:`3fe2a32`)
|
||||
- Handle GET parameters for AJAX crawlable urls (:commit:`3fe2a32`)
|
||||
- Use lxml recover option to parse sitemaps (:issue:`347`)
|
||||
- Bugfix cookie merging by hostname and not by netloc (:issue:`352`)
|
||||
- Support disabling ``HttpCompressionMiddleware`` using a flag setting (:issue:`359`)
|
||||
@ -4939,8 +4939,8 @@ Scrapy 0.18.0 (released 2013-08-09)
|
||||
- Added ``--pdb`` option to ``scrapy`` command line tool
|
||||
- Added :meth:`XPathSelector.remove_namespaces <scrapy.selector.Selector.remove_namespaces>` which allows to remove all namespaces from XML documents for convenience (to work with namespace-less XPaths). Documented in :ref:`topics-selectors`.
|
||||
- Several improvements to spider contracts
|
||||
- New default middleware named MetaRefreshMiddldeware that handles meta-refresh html tag redirections,
|
||||
- MetaRefreshMiddldeware and RedirectMiddleware have different priorities to address #62
|
||||
- New default middleware named MetaRefreshMiddleware that handles meta-refresh html tag redirections,
|
||||
- MetaRefreshMiddleware and RedirectMiddleware have different priorities to address #62
|
||||
- added from_crawler method to spiders
|
||||
- added system tests with mock server
|
||||
- more improvements to macOS compatibility (thanks Alex Cepoi)
|
||||
@ -5082,7 +5082,7 @@ Scrapy changes:
|
||||
- promoted :ref:`topics-djangoitem` to main contrib
|
||||
- LogFormatter method now return dicts(instead of strings) to support lazy formatting (:issue:`164`, :commit:`dcef7b0`)
|
||||
- downloader handlers (:setting:`DOWNLOAD_HANDLERS` setting) now receive settings as the first argument of the ``__init__`` method
|
||||
- replaced memory usage acounting with (more portable) `resource`_ module, removed ``scrapy.utils.memory`` module
|
||||
- replaced memory usage accounting with (more portable) `resource`_ module, removed ``scrapy.utils.memory`` module
|
||||
- removed signal: ``scrapy.mail.mail_sent``
|
||||
- removed ``TRACK_REFS`` setting, now :ref:`trackrefs <topics-leaks-trackrefs>` is always enabled
|
||||
- DBM is now the default storage backend for HTTP cache middleware
|
||||
@ -5148,7 +5148,7 @@ Scrapy 0.14
|
||||
New features and settings
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
- Support for `AJAX crawleable urls`_
|
||||
- Support for `AJAX crawlable urls`_
|
||||
- New persistent scheduler that stores requests on disk, allowing to suspend and resume crawls (:rev:`2737`)
|
||||
- added ``-o`` option to ``scrapy crawl``, a shortcut for dumping scraped items into a file (or standard output using ``-``)
|
||||
- Added support for passing custom settings to Scrapyd ``schedule.json`` api (:rev:`2779`, :rev:`2783`)
|
||||
@ -5408,7 +5408,7 @@ Backward-incompatible changes
|
||||
- Renamed setting: ``REQUESTS_PER_DOMAIN`` to ``CONCURRENT_REQUESTS_PER_SPIDER`` (:rev:`1830`, :rev:`1844`)
|
||||
- Renamed setting: ``CONCURRENT_DOMAINS`` to ``CONCURRENT_SPIDERS`` (:rev:`1830`)
|
||||
- Refactored HTTP Cache middleware
|
||||
- HTTP Cache middleware has been heavilty refactored, retaining the same functionality except for the domain sectorization which was removed. (:rev:`1843` )
|
||||
- HTTP Cache middleware has been heavily refactored, retaining the same functionality except for the domain sectorization which was removed. (:rev:`1843` )
|
||||
- Renamed exception: ``DontCloseDomain`` to ``DontCloseSpider`` (:rev:`1859` | #120)
|
||||
- Renamed extension: ``DelayedCloseDomain`` to ``SpiderCloseDelay`` (:rev:`1861` | #121)
|
||||
- Removed obsolete ``scrapy.utils.markup.remove_escape_chars`` function - use ``scrapy.utils.markup.replace_escape_chars`` instead (:rev:`1865`)
|
||||
@ -5419,7 +5419,7 @@ Scrapy 0.7
|
||||
First release of Scrapy.
|
||||
|
||||
|
||||
.. _AJAX crawleable urls: https://developers.google.com/search/docs/ajax-crawling/docs/getting-started?csw=1
|
||||
.. _AJAX crawlable urls: https://developers.google.com/search/docs/ajax-crawling/docs/getting-started?csw=1
|
||||
.. _botocore: https://github.com/boto/botocore
|
||||
.. _chunked transfer encoding: https://en.wikipedia.org/wiki/Chunked_transfer_encoding
|
||||
.. _ClientForm: http://wwwsearch.sourceforge.net/old/ClientForm/
|
||||
|
@ -636,19 +636,30 @@ DOWNLOAD_DELAY
|
||||
|
||||
Default: ``0``
|
||||
|
||||
The amount of time (in secs) that the downloader should wait before downloading
|
||||
consecutive pages from the same website. This can be used to throttle the
|
||||
crawling speed to avoid hitting servers too hard. Decimal numbers are
|
||||
supported. Example::
|
||||
Minimum seconds to wait between 2 consecutive requests to the same domain.
|
||||
|
||||
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
|
||||
Use :setting:`DOWNLOAD_DELAY` to throttle your crawling speed, to avoid hitting
|
||||
servers too hard.
|
||||
|
||||
Decimal numbers are supported. For example, to send a maximum of 4 requests
|
||||
every 10 seconds::
|
||||
|
||||
DOWNLOAD_DELAY = 2.5
|
||||
|
||||
This setting is also affected by the :setting:`RANDOMIZE_DOWNLOAD_DELAY`
|
||||
setting (which is enabled by default). By default, Scrapy doesn't wait a fixed
|
||||
amount of time between requests, but uses a random interval between 0.5 * :setting:`DOWNLOAD_DELAY` and 1.5 * :setting:`DOWNLOAD_DELAY`.
|
||||
setting, which is enabled by default.
|
||||
|
||||
When :setting:`CONCURRENT_REQUESTS_PER_IP` is non-zero, delays are enforced
|
||||
per ip address instead of per domain.
|
||||
per IP address instead of per domain.
|
||||
|
||||
Note that :setting:`DOWNLOAD_DELAY` can lower the effective per-domain
|
||||
concurrency below :setting:`CONCURRENT_REQUESTS_PER_DOMAIN`. If the response
|
||||
time of a domain is lower than :setting:`DOWNLOAD_DELAY`, the effective
|
||||
concurrency for that domain is 1. When testing throttling configurations, it
|
||||
usually makes sense to lower :setting:`CONCURRENT_REQUESTS_PER_DOMAIN` first,
|
||||
and only increase :setting:`DOWNLOAD_DELAY` once
|
||||
:setting:`CONCURRENT_REQUESTS_PER_DOMAIN` is 1 but a higher throttling is
|
||||
desired.
|
||||
|
||||
.. _spider-download_delay-attribute:
|
||||
|
||||
@ -656,6 +667,11 @@ per ip address instead of per domain.
|
||||
|
||||
This delay can be set per spider using :attr:`download_delay` spider attribute.
|
||||
|
||||
It is also possible to change this setting per domain, although it requires
|
||||
non-trivial code. See the implementation of the :ref:`AutoThrottle
|
||||
<topics-autothrottle>` extension for an example.
|
||||
|
||||
|
||||
.. setting:: DOWNLOAD_HANDLERS
|
||||
|
||||
DOWNLOAD_HANDLERS
|
||||
|
@ -99,7 +99,7 @@ scrapy.Spider
|
||||
.. attribute:: crawler
|
||||
|
||||
This attribute is set by the :meth:`from_crawler` class method after
|
||||
initializating the class, and links to the
|
||||
initializing the class, and links to the
|
||||
:class:`~scrapy.crawler.Crawler` object to which this spider instance is
|
||||
bound.
|
||||
|
||||
|
@ -1,9 +1,10 @@
|
||||
"""
|
||||
A spider that generate light requests to meassure QPS throughput
|
||||
A spider that generate light requests to measure QPS throughput
|
||||
|
||||
usage:
|
||||
|
||||
scrapy runspider qpsclient.py --loglevel=INFO --set RANDOMIZE_DOWNLOAD_DELAY=0 --set CONCURRENT_REQUESTS=50 -a qps=10 -a latency=0.3
|
||||
scrapy runspider qpsclient.py --loglevel=INFO --set RANDOMIZE_DOWNLOAD_DELAY=0
|
||||
--set CONCURRENT_REQUESTS=50 -a qps=10 -a latency=0.3
|
||||
|
||||
"""
|
||||
|
||||
|
@ -24,7 +24,7 @@ class ScrapyArgumentParser(argparse.ArgumentParser):
|
||||
|
||||
|
||||
def _iter_command_classes(module_name):
|
||||
# TODO: add `name` attribute to commands and and merge this function with
|
||||
# TODO: add `name` attribute to commands and merge this function with
|
||||
# scrapy.utils.spider.iter_spider_classes
|
||||
for module in walk_modules(module_name):
|
||||
for obj in vars(module).values():
|
||||
|
@ -83,7 +83,9 @@ class ScrapyClientContextFactory(BrowserLikePolicyForHTTPS):
|
||||
# kept for old-style HTTP/1.0 downloader context twisted calls,
|
||||
# e.g. connectSSL()
|
||||
def getContext(self, hostname=None, port=None):
|
||||
return self.getCertificateOptions().getContext()
|
||||
ctx = self.getCertificateOptions().getContext()
|
||||
ctx.set_options(0x4) # OP_LEGACY_SERVER_CONNECT
|
||||
return ctx
|
||||
|
||||
def creatorForNetloc(self, hostname, port):
|
||||
return ScrapyClientTLSOptions(
|
||||
|
@ -23,8 +23,8 @@ METHOD_TLSv12 = "TLSv1.2"
|
||||
openssl_methods = {
|
||||
METHOD_TLS: SSL.SSLv23_METHOD, # protocol negotiation (recommended)
|
||||
METHOD_TLSv10: SSL.TLSv1_METHOD, # TLS 1.0 only
|
||||
METHOD_TLSv11: getattr(SSL, "TLSv1_1_METHOD", 5), # TLS 1.1 only
|
||||
METHOD_TLSv12: getattr(SSL, "TLSv1_2_METHOD", 6), # TLS 1.2 only
|
||||
METHOD_TLSv11: SSL.TLSv1_1_METHOD, # TLS 1.1 only
|
||||
METHOD_TLSv12: SSL.TLSv1_2_METHOD, # TLS 1.2 only
|
||||
}
|
||||
|
||||
|
||||
|
@ -101,7 +101,7 @@ class ScrapyHTTPPageGetter(HTTPClient):
|
||||
# This class used to inherit from Twisted’s
|
||||
# twisted.web.client.HTTPClientFactory. When that class was deprecated in
|
||||
# Twisted (https://github.com/twisted/twisted/pull/643), we merged its
|
||||
# non-overriden code into this class.
|
||||
# non-overridden code into this class.
|
||||
class ScrapyHTTPClientFactory(ClientFactory):
|
||||
|
||||
protocol = ScrapyHTTPPageGetter
|
||||
|
@ -348,7 +348,7 @@ class Stream:
|
||||
|
||||
def receive_headers(self, headers: List[HeaderTuple]) -> None:
|
||||
for name, value in headers:
|
||||
self._response["headers"][name] = value
|
||||
self._response["headers"].appendlist(name, value)
|
||||
|
||||
# Check if we exceed the allowed max data size which can be received
|
||||
expected_size = int(self._response["headers"].get(b"Content-Length", -1))
|
||||
|
@ -384,11 +384,11 @@ class FeedExporter:
|
||||
return defer.DeferredList(deferred_list) if deferred_list else None
|
||||
|
||||
def _close_slot(self, slot, spider):
|
||||
slot.finish_exporting()
|
||||
if not slot.itemcount and not slot.store_empty:
|
||||
# We need to call slot.storage.store nonetheless to get the file
|
||||
# properly closed.
|
||||
return defer.maybeDeferred(slot.storage.store, slot.file)
|
||||
slot.finish_exporting()
|
||||
logmsg = f"{slot.format} feed ({slot.itemcount} items) in: {slot.uri}"
|
||||
d = defer.maybeDeferred(slot.storage.store, slot.file)
|
||||
|
||||
|
@ -196,7 +196,7 @@ class RFC2616Policy:
|
||||
if response.status in (300, 301, 308):
|
||||
return self.MAXAGE
|
||||
|
||||
# Insufficient information to compute fresshness lifetime
|
||||
# Insufficient information to compute freshness lifetime
|
||||
return 0
|
||||
|
||||
def _compute_current_age(self, response, request, now):
|
||||
|
@ -200,7 +200,7 @@ def _select_value(ele: SelectElement, n: str, v: str):
|
||||
o = ele.value_options
|
||||
return (n, o[0]) if o else (None, None)
|
||||
if v is not None and multiple:
|
||||
# This is a workround to bug in lxml fixed 2.3.1
|
||||
# This is a workaround to bug in lxml fixed 2.3.1
|
||||
# fix https://github.com/lxml/lxml/commit/57f49eed82068a20da3db8f1b18ae00c1bab8b12#L1L1139
|
||||
selected_options = ele.xpath(".//option[@selected]")
|
||||
values = [(o.get("value") or o.text or "").strip() for o in selected_options]
|
||||
|
@ -226,7 +226,8 @@ class LxmlLinkExtractor:
|
||||
Only links that match the settings passed to the ``__init__`` method of
|
||||
the link extractor are returned.
|
||||
|
||||
Duplicate links are omitted.
|
||||
Duplicate links are omitted if the ``unique`` attribute is set to ``True``,
|
||||
otherwise they are returned.
|
||||
"""
|
||||
base_url = get_base_url(response)
|
||||
if self.restrict_xpaths:
|
||||
@ -239,4 +240,6 @@ class LxmlLinkExtractor:
|
||||
for doc in docs:
|
||||
links = self._extract_links(doc, response.url, response.encoding, base_url)
|
||||
all_links.extend(self._process_links(links))
|
||||
return unique_list(all_links)
|
||||
if self.link_extractor.unique:
|
||||
return unique_list(all_links)
|
||||
return all_links
|
||||
|
@ -151,8 +151,8 @@ class ImagesPipeline(FilesPipeline):
|
||||
)
|
||||
if self._deprecated_convert_image:
|
||||
warnings.warn(
|
||||
f"{self.__class__.__name__}.convert_image() method overriden in a deprecated way, "
|
||||
"overriden method does not accept response_body argument.",
|
||||
f"{self.__class__.__name__}.convert_image() method overridden in a deprecated way, "
|
||||
"overridden method does not accept response_body argument.",
|
||||
category=ScrapyDeprecationWarning,
|
||||
)
|
||||
|
||||
|
@ -177,7 +177,11 @@ class Shell:
|
||||
|
||||
def inspect_response(response, spider):
|
||||
"""Open a shell to inspect the given response"""
|
||||
# Shell.start removes the SIGINT handler, so save it and re-add it after
|
||||
# the shell has closed
|
||||
sigint_handler = signal.getsignal(signal.SIGINT)
|
||||
Shell(spider.crawler).start(response=response, spider=spider)
|
||||
signal.signal(signal.SIGINT, sigint_handler)
|
||||
|
||||
|
||||
def _request_deferred(request):
|
||||
|
@ -1,5 +1,4 @@
|
||||
from functools import wraps
|
||||
from collections import OrderedDict
|
||||
|
||||
|
||||
def _embed_ipython_shell(namespace={}, banner=""):
|
||||
@ -70,14 +69,12 @@ def _embed_standard_shell(namespace={}, banner=""):
|
||||
return wrapper
|
||||
|
||||
|
||||
DEFAULT_PYTHON_SHELLS = OrderedDict(
|
||||
[
|
||||
("ptpython", _embed_ptpython_shell),
|
||||
("ipython", _embed_ipython_shell),
|
||||
("bpython", _embed_bpython_shell),
|
||||
("python", _embed_standard_shell),
|
||||
]
|
||||
)
|
||||
DEFAULT_PYTHON_SHELLS = {
|
||||
"ptpython": _embed_ptpython_shell,
|
||||
"ipython": _embed_ipython_shell,
|
||||
"bpython": _embed_bpython_shell,
|
||||
"python": _embed_standard_shell,
|
||||
}
|
||||
|
||||
|
||||
def get_shell_embed_func(shells=None, known_shells=None):
|
||||
|
@ -26,10 +26,7 @@ from twisted.python import failure
|
||||
from twisted.python.failure import Failure
|
||||
|
||||
from scrapy.exceptions import IgnoreRequest
|
||||
from scrapy.utils.reactor import (
|
||||
is_asyncio_reactor_installed,
|
||||
get_asyncio_event_loop_policy,
|
||||
)
|
||||
from scrapy.utils.reactor import is_asyncio_reactor_installed, _get_asyncio_event_loop
|
||||
|
||||
|
||||
def defer_fail(_failure: Failure) -> Deferred:
|
||||
@ -290,7 +287,7 @@ def deferred_from_coro(o) -> Any:
|
||||
# that use asyncio, e.g. "await asyncio.sleep(1)"
|
||||
return ensureDeferred(o)
|
||||
# wrapping the coroutine into a Future and then into a Deferred, this requires AsyncioSelectorReactor
|
||||
event_loop = get_asyncio_event_loop_policy().get_event_loop()
|
||||
event_loop = _get_asyncio_event_loop()
|
||||
return Deferred.fromFuture(asyncio.ensure_future(o, loop=event_loop))
|
||||
return o
|
||||
|
||||
@ -343,8 +340,7 @@ def deferred_to_future(d: Deferred) -> Future:
|
||||
d = treq.get('https://example.com/additional')
|
||||
additional_response = await deferred_to_future(d)
|
||||
"""
|
||||
policy = get_asyncio_event_loop_policy()
|
||||
return d.asFuture(policy.get_event_loop())
|
||||
return d.asFuture(_get_asyncio_event_loop())
|
||||
|
||||
|
||||
def maybe_deferred_to_future(d: Deferred) -> Union[Deferred, Future]:
|
||||
|
@ -1,6 +1,7 @@
|
||||
import asyncio
|
||||
import sys
|
||||
from contextlib import suppress
|
||||
from warnings import catch_warnings, filterwarnings
|
||||
|
||||
from twisted.internet import asyncioreactor, error
|
||||
|
||||
@ -83,6 +84,10 @@ def install_reactor(reactor_path, event_loop_path=None):
|
||||
installer()
|
||||
|
||||
|
||||
def _get_asyncio_event_loop():
|
||||
return set_asyncio_event_loop(None)
|
||||
|
||||
|
||||
def set_asyncio_event_loop(event_loop_path):
|
||||
"""Sets and returns the event loop with specified import path."""
|
||||
policy = get_asyncio_event_loop_policy()
|
||||
@ -92,11 +97,26 @@ def set_asyncio_event_loop(event_loop_path):
|
||||
asyncio.set_event_loop(event_loop)
|
||||
else:
|
||||
try:
|
||||
event_loop = policy.get_event_loop()
|
||||
with catch_warnings():
|
||||
# In Python 3.10.9, 3.11.1, 3.12 and 3.13, a DeprecationWarning
|
||||
# is emitted about the lack of a current event loop, because in
|
||||
# Python 3.14 and later `get_event_loop` will raise a
|
||||
# RuntimeError in that event. Because our code is already
|
||||
# prepared for that future behavior, we ignore the deprecation
|
||||
# warning.
|
||||
filterwarnings(
|
||||
"ignore",
|
||||
message="There is no current event loop",
|
||||
category=DeprecationWarning,
|
||||
)
|
||||
event_loop = policy.get_event_loop()
|
||||
except RuntimeError:
|
||||
# `get_event_loop` is expected to fail when called from a new thread
|
||||
# with no asyncio event loop yet installed. Such is the case when
|
||||
# called from `scrapy shell`
|
||||
# `get_event_loop` raises RuntimeError when called with no asyncio
|
||||
# event loop yet installed in the following scenarios:
|
||||
# - From a thread other than the main thread. For example, when
|
||||
# using ``scrapy shell``.
|
||||
# - Previsibly on Python 3.14 and later.
|
||||
# https://github.com/python/cpython/issues/100160#issuecomment-1345581902
|
||||
event_loop = policy.new_event_loop()
|
||||
asyncio.set_event_loop(event_loop)
|
||||
return event_loop
|
||||
|
@ -40,7 +40,7 @@ def get_meta_refresh(
|
||||
response: "scrapy.http.response.text.TextResponse",
|
||||
ignore_tags: Optional[Iterable[str]] = ("script", "noscript"),
|
||||
) -> Union[Tuple[None, None], Tuple[float, str]]:
|
||||
"""Parse the http-equiv refrsh parameter from the given response"""
|
||||
"""Parse the http-equiv refresh parameter from the given response"""
|
||||
if response not in _metaref_cache:
|
||||
text = response.text[0:4096]
|
||||
_metaref_cache[response] = html.get_meta_refresh(
|
||||
|
@ -1,14 +1,9 @@
|
||||
import OpenSSL
|
||||
import OpenSSL.SSL
|
||||
import OpenSSL._util as pyOpenSSLutil
|
||||
|
||||
from scrapy.utils.python import to_unicode
|
||||
|
||||
|
||||
# The OpenSSL symbol is present since 1.1.1 but it's not currently supported in any version of pyOpenSSL.
|
||||
# Using the binding directly, as this code does, requires cryptography 2.4.
|
||||
SSL_OP_NO_TLSv1_3 = getattr(pyOpenSSLutil.lib, "SSL_OP_NO_TLSv1_3", 0)
|
||||
|
||||
|
||||
def ffi_buf_to_string(buf):
|
||||
return to_unicode(pyOpenSSLutil.ffi.string(buf))
|
||||
|
||||
@ -24,11 +19,6 @@ def x509name_to_string(x509name):
|
||||
|
||||
|
||||
def get_temp_key_info(ssl_object):
|
||||
if not hasattr(
|
||||
pyOpenSSLutil.lib, "SSL_get_server_tmp_key"
|
||||
): # requires OpenSSL 1.0.2
|
||||
return None
|
||||
|
||||
# adapted from OpenSSL apps/s_cb.c::ssl_print_tmp_key()
|
||||
temp_key_p = pyOpenSSLutil.ffi.new("EVP_PKEY **")
|
||||
if not pyOpenSSLutil.lib.SSL_get_server_tmp_key(ssl_object, temp_key_p):
|
||||
|
@ -48,7 +48,7 @@ def parse_url(url, encoding=None):
|
||||
|
||||
def escape_ajax(url):
|
||||
"""
|
||||
Return the crawleable url according to:
|
||||
Return the crawlable url according to:
|
||||
https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
|
||||
|
||||
>>> escape_ajax("www.example.com/ajax.html#!key=value")
|
||||
|
@ -148,7 +148,7 @@ Another example could be for building URL canonicalizers:
|
||||
::
|
||||
|
||||
#!python
|
||||
class CanonializeUrl(LegSpider):
|
||||
class CanonicalizeUrl(LegSpider):
|
||||
|
||||
def process_request(self, request):
|
||||
curl = canonicalize_url(request.url, rules=self.spider.canonicalization_rules)
|
||||
|
@ -321,7 +321,7 @@ Another example could be for building URL canonicalizers:
|
||||
::
|
||||
|
||||
#!python
|
||||
class CanonializeUrl(object):
|
||||
class CanonicalizeUrl(object):
|
||||
|
||||
def process_request(self, request, response, spider):
|
||||
curl = canonicalize_url(request.url,
|
||||
@ -594,18 +594,18 @@ A middleware to Scrape data using Parsley as described in UsingParsley
|
||||
|
||||
class ParsleyExtractor(object):
|
||||
|
||||
def __init__(self, parslet_json_code):
|
||||
parslet = json.loads(parselet_json_code)
|
||||
def __init__(self, parsley_json_code):
|
||||
parsley = json.loads(parselet_json_code)
|
||||
class ParsleyItem(Item):
|
||||
def __init__(self, *a, **kw):
|
||||
for name in parslet.keys():
|
||||
for name in parsley.keys():
|
||||
self.fields[name] = Field()
|
||||
super(ParsleyItem, self).__init__(*a, **kw)
|
||||
self.item_class = ParsleyItem
|
||||
self.parsley = PyParsley(parslet, output='python')
|
||||
self.parsley = PyParsley(parsley, output='python')
|
||||
|
||||
def process_response(self, response, request, spider):
|
||||
return self.item_class(self.parsly.parse(string=response.body))
|
||||
return self.item_class(self.parsley.parse(string=response.body))
|
||||
|
||||
|
||||
|
||||
|
@ -79,7 +79,7 @@ If it raises an exception, Scrapy will print it and exit.
|
||||
Examples::
|
||||
|
||||
def addon_configure(settings):
|
||||
settings.overrides['DOWNLADER_MIDDLEWARES'].update({
|
||||
settings.overrides['DOWNLOADER_MIDDLEWARES'].update({
|
||||
'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
|
||||
})
|
||||
|
||||
|
2
setup.py
2
setup.py
@ -19,7 +19,7 @@ def has_environment_marker_platform_impl_support():
|
||||
|
||||
install_requires = [
|
||||
"Twisted>=18.9.0",
|
||||
"cryptography>=3.3",
|
||||
"cryptography>=3.4.6",
|
||||
"cssselect>=0.9.1",
|
||||
"itemloaders>=1.0.1",
|
||||
"parsel>=1.5.0",
|
||||
|
@ -19,7 +19,6 @@ from twisted.web.static import File
|
||||
from twisted.web.util import redirectTo
|
||||
|
||||
from scrapy.utils.python import to_bytes, to_unicode
|
||||
from scrapy.utils.ssl import SSL_OP_NO_TLSv1_3
|
||||
from scrapy.utils.test import get_testenv
|
||||
|
||||
|
||||
@ -358,7 +357,7 @@ def ssl_context_factory(
|
||||
if cipher_string:
|
||||
ctx = factory.getContext()
|
||||
# disabling TLS1.3 because it unconditionally enables some strong ciphers
|
||||
ctx.set_options(SSL.OP_CIPHER_SERVER_PREFERENCE | SSL_OP_NO_TLSv1_3)
|
||||
ctx.set_options(SSL.OP_CIPHER_SERVER_PREFERENCE | SSL.OP_NO_TLSv1_3)
|
||||
ctx.set_cipher_list(to_bytes(cipher_string))
|
||||
return factory
|
||||
|
||||
|
@ -11,6 +11,6 @@ class ZeroDivisionErrorPipeline:
|
||||
return item
|
||||
|
||||
|
||||
class ProcessWithZeroDivisionErrorPipiline:
|
||||
class ProcessWithZeroDivisionErrorPipeline:
|
||||
def process_item(self, item, spider):
|
||||
1 / 0
|
||||
|
@ -13,6 +13,7 @@
|
||||
</div>
|
||||
<a href='http://example.com/sample3.html' title='sample 3'>sample 3 text</a>
|
||||
<a href='sample3.html'>sample 3 repetition</a>
|
||||
<a href='sample3.html'>sample 3 repetition</a>
|
||||
<a href='sample3.html#foo'>sample 3 repetition with fragment</a>
|
||||
<a href='http://www.google.com/something'></a>
|
||||
<a href='http://example.com/innertag.html'><strong>inner</strong> tag</a>
|
||||
|
@ -336,7 +336,7 @@ class StartprojectTemplatesTest(ProjectTest):
|
||||
self.assertEqual(actual_permissions, expected_permissions)
|
||||
|
||||
def test_startproject_permissions_unchanged_in_destination(self):
|
||||
"""Check that pre-existing folders and files in the destination folder
|
||||
"""Check that preexisting folders and files in the destination folder
|
||||
do not see their permissions modified."""
|
||||
scrapy_path = scrapy.__path__[0]
|
||||
project_template = Path(scrapy_path, "templates", "project")
|
||||
|
@ -154,7 +154,7 @@ class CrawlTestCase(TestCase):
|
||||
raise unittest.SkipTest("Non-existing hosts are resolvable")
|
||||
crawler = get_crawler(SimpleSpider)
|
||||
with LogCapture() as log:
|
||||
# try to fetch the homepage of a non-existent domain
|
||||
# try to fetch the homepage of a nonexistent domain
|
||||
yield crawler.crawl(
|
||||
"http://dns.resolution.invalid./", mockserver=self.mockserver
|
||||
)
|
||||
@ -183,7 +183,7 @@ class CrawlTestCase(TestCase):
|
||||
self.assertIs(record.exc_info[0], ZeroDivisionError)
|
||||
|
||||
@defer.inlineCallbacks
|
||||
def test_start_requests_lazyness(self):
|
||||
def test_start_requests_laziness(self):
|
||||
settings = {"CONCURRENT_REQUESTS": 1}
|
||||
crawler = get_crawler(BrokenStartRequestsSpider, settings)
|
||||
yield crawler.crawl(mockserver=self.mockserver)
|
||||
|
@ -209,6 +209,12 @@ class LargeChunkedFileResource(resource.Resource):
|
||||
return server.NOT_DONE_YET
|
||||
|
||||
|
||||
class DuplicateHeaderResource(resource.Resource):
|
||||
def render(self, request):
|
||||
request.responseHeaders.setRawHeaders(b"Set-Cookie", [b"a=b", b"c=d"])
|
||||
return b""
|
||||
|
||||
|
||||
class HttpTestCase(unittest.TestCase):
|
||||
scheme = "http"
|
||||
download_handler_cls: Type = HTTPDownloadHandler
|
||||
@ -234,6 +240,7 @@ class HttpTestCase(unittest.TestCase):
|
||||
r.putChild(b"contentlength", ContentLengthHeaderResource())
|
||||
r.putChild(b"nocontenttype", EmptyContentTypeHeaderResource())
|
||||
r.putChild(b"largechunkedfile", LargeChunkedFileResource())
|
||||
r.putChild(b"duplicate-header", DuplicateHeaderResource())
|
||||
r.putChild(b"echo", Echo())
|
||||
self.site = server.Site(r, timeout=None)
|
||||
self.wrapper = WrappingFactory(self.site)
|
||||
@ -407,6 +414,16 @@ class HttpTestCase(unittest.TestCase):
|
||||
HtmlResponse,
|
||||
)
|
||||
|
||||
def test_get_duplicate_header(self):
|
||||
def _test(response):
|
||||
self.assertEqual(
|
||||
response.headers.getlist(b"Set-Cookie"),
|
||||
[b"a=b", b"c=d"],
|
||||
)
|
||||
|
||||
request = Request(self.getURL("duplicate-header"))
|
||||
return self.download_request(request, Spider("foo")).addCallback(_test)
|
||||
|
||||
|
||||
class Http10TestCase(HttpTestCase):
|
||||
"""HTTP 1.0 test case"""
|
||||
@ -1095,9 +1112,9 @@ class BaseFTPTestCase(unittest.TestCase):
|
||||
|
||||
return self._add_test_callbacks(d, _test)
|
||||
|
||||
def test_ftp_download_notexist(self):
|
||||
def test_ftp_download_nonexistent(self):
|
||||
request = Request(
|
||||
url=f"ftp://127.0.0.1:{self.portNum}/notexist.txt", meta=self.req_meta
|
||||
url=f"ftp://127.0.0.1:{self.portNum}/nonexistent.txt", meta=self.req_meta
|
||||
)
|
||||
d = self.download_handler.download_request(request, None)
|
||||
|
||||
|
@ -19,7 +19,7 @@ class UserAgentMiddlewareTest(TestCase):
|
||||
self.assertEqual(req.headers["User-Agent"], b"default_useragent")
|
||||
|
||||
def test_remove_agent(self):
|
||||
# settings UESR_AGENT to None should remove the user agent
|
||||
# settings USER_AGENT to None should remove the user agent
|
||||
spider, mw = self.get_spider_and_mw("default_useragent")
|
||||
spider.user_agent = None
|
||||
mw.spider_opened(spider)
|
||||
|
@ -109,7 +109,7 @@ class DataClassItemsSpider(TestSpider):
|
||||
class ItemZeroDivisionErrorSpider(TestSpider):
|
||||
custom_settings = {
|
||||
"ITEM_PIPELINES": {
|
||||
"tests.pipelines.ProcessWithZeroDivisionErrorPipiline": 300,
|
||||
"tests.pipelines.ProcessWithZeroDivisionErrorPipeline": 300,
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -33,8 +33,9 @@ from zope.interface.verify import verifyObject
|
||||
|
||||
import scrapy
|
||||
from scrapy.exceptions import NotConfigured, ScrapyDeprecationWarning
|
||||
from scrapy.exporters import CsvItemExporter
|
||||
from scrapy.exporters import CsvItemExporter, JsonItemExporter
|
||||
from scrapy.extensions.feedexport import (
|
||||
_FeedSlot,
|
||||
BlockingFeedStorage,
|
||||
FeedExporter,
|
||||
FileFeedStorage,
|
||||
@ -664,6 +665,50 @@ class FeedExportTestBase(ABC, unittest.TestCase):
|
||||
return result
|
||||
|
||||
|
||||
class InstrumentedFeedSlot(_FeedSlot):
|
||||
"""Instrumented _FeedSlot subclass for keeping track of calls to
|
||||
start_exporting and finish_exporting."""
|
||||
|
||||
def start_exporting(self):
|
||||
self.update_listener("start")
|
||||
super().start_exporting()
|
||||
|
||||
def finish_exporting(self):
|
||||
self.update_listener("finish")
|
||||
super().finish_exporting()
|
||||
|
||||
@classmethod
|
||||
def subscribe__listener(cls, listener):
|
||||
cls.update_listener = listener.update
|
||||
|
||||
|
||||
class IsExportingListener:
|
||||
"""When subscribed to InstrumentedFeedSlot, keeps track of when
|
||||
a call to start_exporting has been made without a closing call to
|
||||
finish_exporting and when a call to finish_exporting has been made
|
||||
before a call to start_exporting."""
|
||||
|
||||
def __init__(self):
|
||||
self.start_without_finish = False
|
||||
self.finish_without_start = False
|
||||
|
||||
def update(self, method):
|
||||
if method == "start":
|
||||
self.start_without_finish = True
|
||||
elif method == "finish":
|
||||
if self.start_without_finish:
|
||||
self.start_without_finish = False
|
||||
else:
|
||||
self.finish_before_start = True
|
||||
|
||||
|
||||
class ExceptionJsonItemExporter(JsonItemExporter):
|
||||
"""JsonItemExporter that throws an exception every time export_item is called."""
|
||||
|
||||
def export_item(self, _):
|
||||
raise Exception("foo")
|
||||
|
||||
|
||||
class FeedExportTest(FeedExportTestBase):
|
||||
__test__ = True
|
||||
|
||||
@ -909,6 +954,84 @@ class FeedExportTest(FeedExportTestBase):
|
||||
data = yield self.exported_no_data(settings)
|
||||
self.assertEqual(b"", data[fmt])
|
||||
|
||||
@defer.inlineCallbacks
|
||||
def test_start_finish_exporting_items(self):
|
||||
items = [
|
||||
self.MyItem({"foo": "bar1", "egg": "spam1"}),
|
||||
]
|
||||
settings = {
|
||||
"FEEDS": {
|
||||
self._random_temp_filename(): {"format": "json"},
|
||||
},
|
||||
"FEED_EXPORT_INDENT": None,
|
||||
}
|
||||
|
||||
listener = IsExportingListener()
|
||||
InstrumentedFeedSlot.subscribe__listener(listener)
|
||||
|
||||
with mock.patch("scrapy.extensions.feedexport._FeedSlot", InstrumentedFeedSlot):
|
||||
_ = yield self.exported_data(items, settings)
|
||||
self.assertFalse(listener.start_without_finish)
|
||||
self.assertFalse(listener.finish_without_start)
|
||||
|
||||
@defer.inlineCallbacks
|
||||
def test_start_finish_exporting_no_items(self):
|
||||
items = []
|
||||
settings = {
|
||||
"FEEDS": {
|
||||
self._random_temp_filename(): {"format": "json"},
|
||||
},
|
||||
"FEED_EXPORT_INDENT": None,
|
||||
}
|
||||
|
||||
listener = IsExportingListener()
|
||||
InstrumentedFeedSlot.subscribe__listener(listener)
|
||||
|
||||
with mock.patch("scrapy.extensions.feedexport._FeedSlot", InstrumentedFeedSlot):
|
||||
_ = yield self.exported_data(items, settings)
|
||||
self.assertFalse(listener.start_without_finish)
|
||||
self.assertFalse(listener.finish_without_start)
|
||||
|
||||
@defer.inlineCallbacks
|
||||
def test_start_finish_exporting_items_exception(self):
|
||||
items = [
|
||||
self.MyItem({"foo": "bar1", "egg": "spam1"}),
|
||||
]
|
||||
settings = {
|
||||
"FEEDS": {
|
||||
self._random_temp_filename(): {"format": "json"},
|
||||
},
|
||||
"FEED_EXPORTERS": {"json": ExceptionJsonItemExporter},
|
||||
"FEED_EXPORT_INDENT": None,
|
||||
}
|
||||
|
||||
listener = IsExportingListener()
|
||||
InstrumentedFeedSlot.subscribe__listener(listener)
|
||||
|
||||
with mock.patch("scrapy.extensions.feedexport._FeedSlot", InstrumentedFeedSlot):
|
||||
_ = yield self.exported_data(items, settings)
|
||||
self.assertFalse(listener.start_without_finish)
|
||||
self.assertFalse(listener.finish_without_start)
|
||||
|
||||
@defer.inlineCallbacks
|
||||
def test_start_finish_exporting_no_items_exception(self):
|
||||
items = []
|
||||
settings = {
|
||||
"FEEDS": {
|
||||
self._random_temp_filename(): {"format": "json"},
|
||||
},
|
||||
"FEED_EXPORTERS": {"json": ExceptionJsonItemExporter},
|
||||
"FEED_EXPORT_INDENT": None,
|
||||
}
|
||||
|
||||
listener = IsExportingListener()
|
||||
InstrumentedFeedSlot.subscribe__listener(listener)
|
||||
|
||||
with mock.patch("scrapy.extensions.feedexport._FeedSlot", InstrumentedFeedSlot):
|
||||
_ = yield self.exported_data(items, settings)
|
||||
self.assertFalse(listener.start_without_finish)
|
||||
self.assertFalse(listener.finish_without_start)
|
||||
|
||||
@defer.inlineCallbacks
|
||||
def test_export_no_items_store_empty(self):
|
||||
formats = (
|
||||
|
@ -399,7 +399,7 @@ class RequestTest(unittest.TestCase):
|
||||
)
|
||||
self.assertEqual(r.method, "DELETE")
|
||||
|
||||
# If `ignore_unknon_options` is set to `False` it raises an error with
|
||||
# If `ignore_unknown_options` is set to `False` it raises an error with
|
||||
# the unknown options: --foo and -z
|
||||
self.assertRaises(
|
||||
ValueError,
|
||||
@ -997,7 +997,7 @@ class FormRequestTest(RequestTest):
|
||||
fs = _qs(r1)
|
||||
self.assertEqual(fs, {b"four": [b"4"], b"three": [b"3"]})
|
||||
|
||||
def test_from_response_formname_notexist(self):
|
||||
def test_from_response_formname_nonexistent(self):
|
||||
response = _buildresponse(
|
||||
"""<form name="form1" action="post.php" method="POST">
|
||||
<input type="hidden" name="one" value="1">
|
||||
@ -1044,7 +1044,7 @@ class FormRequestTest(RequestTest):
|
||||
fs = _qs(r1)
|
||||
self.assertEqual(fs, {b"four": [b"4"], b"three": [b"3"]})
|
||||
|
||||
def test_from_response_formname_notexists_fallback_formid(self):
|
||||
def test_from_response_formname_nonexistent_fallback_formid(self):
|
||||
response = _buildresponse(
|
||||
"""<form action="post.php" method="POST">
|
||||
<input type="hidden" name="one" value="1">
|
||||
@ -1062,7 +1062,7 @@ class FormRequestTest(RequestTest):
|
||||
fs = _qs(r1)
|
||||
self.assertEqual(fs, {b"four": [b"4"], b"three": [b"3"]})
|
||||
|
||||
def test_from_response_formid_notexist(self):
|
||||
def test_from_response_formid_nonexistent(self):
|
||||
response = _buildresponse(
|
||||
"""<form id="form1" action="post.php" method="POST">
|
||||
<input type="hidden" name="one" value="1">
|
||||
|
@ -518,7 +518,7 @@ class TextResponseTest(BaseResponseTest):
|
||||
def test_bom_is_removed_from_body(self):
|
||||
# Inferring encoding from body also cache decoded body as sideeffect,
|
||||
# this test tries to ensure that calling response.encoding and
|
||||
# response.text in indistint order doesn't affect final
|
||||
# response.text in indistinct order doesn't affect final
|
||||
# values for encoding and decoded body.
|
||||
url = "http://example.com"
|
||||
body = b"\xef\xbb\xbfWORD"
|
||||
@ -645,6 +645,7 @@ class TextResponseTest(BaseResponseTest):
|
||||
"http://example.com/sample2.html",
|
||||
"http://example.com/sample3.html",
|
||||
"http://example.com/sample3.html",
|
||||
"http://example.com/sample3.html",
|
||||
"http://example.com/sample3.html#foo",
|
||||
"http://www.google.com/something",
|
||||
"http://example.com/innertag.html",
|
||||
|
@ -74,6 +74,10 @@ class Base:
|
||||
url="http://example.com/sample3.html",
|
||||
text="sample 3 repetition",
|
||||
),
|
||||
Link(
|
||||
url="http://example.com/sample3.html",
|
||||
text="sample 3 repetition",
|
||||
),
|
||||
Link(
|
||||
url="http://example.com/sample3.html#foo",
|
||||
text="sample 3 repetition with fragment",
|
||||
@ -93,6 +97,10 @@ class Base:
|
||||
url="http://example.com/sample3.html",
|
||||
text="sample 3 repetition",
|
||||
),
|
||||
Link(
|
||||
url="http://example.com/sample3.html",
|
||||
text="sample 3 repetition",
|
||||
),
|
||||
Link(
|
||||
url="http://example.com/sample3.html",
|
||||
text="sample 3 repetition with fragment",
|
||||
|
@ -225,8 +225,8 @@ class ImagesPipelineTestCase(unittest.TestCase):
|
||||
self.assertEqual(buf.getvalue(), thumb_buf.getvalue())
|
||||
|
||||
expected_warning_msg = (
|
||||
".convert_image() method overriden in a deprecated way, "
|
||||
"overriden method does not accept response_body argument."
|
||||
".convert_image() method overridden in a deprecated way, "
|
||||
"overridden method does not accept response_body argument."
|
||||
)
|
||||
self.assertEqual(
|
||||
len(
|
||||
@ -244,7 +244,7 @@ class ImagesPipelineTestCase(unittest.TestCase):
|
||||
with warnings.catch_warnings(record=True) as w:
|
||||
warnings.simplefilter("always")
|
||||
SIZE = (100, 100)
|
||||
# straigh forward case: RGB and JPEG
|
||||
# straight forward case: RGB and JPEG
|
||||
COLOUR = (0, 127, 255)
|
||||
im, _ = _create_image("JPEG", "RGB", SIZE, COLOUR)
|
||||
converted, _ = self.pipeline.convert_image(im)
|
||||
@ -271,7 +271,7 @@ class ImagesPipelineTestCase(unittest.TestCase):
|
||||
self.assertEqual(converted.mode, "RGB")
|
||||
self.assertEqual(converted.getcolors(), [(10000, (205, 230, 255))])
|
||||
|
||||
# ensure that we recieved deprecation warnings
|
||||
# ensure that we received deprecation warnings
|
||||
expected_warning_msg = ".convert_image() method called in a deprecated way"
|
||||
self.assertTrue(
|
||||
len(
|
||||
@ -287,7 +287,7 @@ class ImagesPipelineTestCase(unittest.TestCase):
|
||||
def test_convert_image_new(self):
|
||||
# tests for new API
|
||||
SIZE = (100, 100)
|
||||
# straigh forward case: RGB and JPEG
|
||||
# straight forward case: RGB and JPEG
|
||||
COLOUR = (0, 127, 255)
|
||||
im, buf = _create_image("JPEG", "RGB", SIZE, COLOUR)
|
||||
converted, converted_buf = self.pipeline.convert_image(im, response_body=buf)
|
||||
|
@ -11,12 +11,12 @@ from tests.mockserver import MockServer
|
||||
from tests.spiders import SingleRequestSpider
|
||||
|
||||
|
||||
OVERRIDEN_URL = "https://example.org"
|
||||
OVERRIDDEN_URL = "https://example.org"
|
||||
|
||||
|
||||
class ProcessResponseMiddleware:
|
||||
def process_response(self, request, response, spider):
|
||||
return response.replace(request=Request(OVERRIDEN_URL))
|
||||
return response.replace(request=Request(OVERRIDDEN_URL))
|
||||
|
||||
|
||||
class RaiseExceptionRequestMiddleware:
|
||||
@ -30,7 +30,7 @@ class CatchExceptionOverrideRequestMiddleware:
|
||||
return Response(
|
||||
url="http://localhost/",
|
||||
body=b"Caught " + exception.__class__.__name__.encode("utf-8"),
|
||||
request=Request(OVERRIDEN_URL),
|
||||
request=Request(OVERRIDDEN_URL),
|
||||
)
|
||||
|
||||
|
||||
@ -52,7 +52,7 @@ class AlternativeCallbacksSpider(SingleRequestSpider):
|
||||
class AlternativeCallbacksMiddleware:
|
||||
def process_response(self, request, response, spider):
|
||||
new_request = request.replace(
|
||||
url=OVERRIDEN_URL,
|
||||
url=OVERRIDDEN_URL,
|
||||
callback=spider.alt_callback,
|
||||
cb_kwargs={"foo": "bar"},
|
||||
)
|
||||
@ -132,16 +132,16 @@ class CrawlTestCase(TestCase):
|
||||
yield crawler.crawl(seed=url, mockserver=self.mockserver)
|
||||
|
||||
response = crawler.spider.meta["responses"][0]
|
||||
self.assertEqual(response.request.url, OVERRIDEN_URL)
|
||||
self.assertEqual(response.request.url, OVERRIDDEN_URL)
|
||||
|
||||
self.assertEqual(signal_params["response"].url, url)
|
||||
self.assertEqual(signal_params["request"].url, OVERRIDEN_URL)
|
||||
self.assertEqual(signal_params["request"].url, OVERRIDDEN_URL)
|
||||
|
||||
log.check_present(
|
||||
(
|
||||
"scrapy.core.engine",
|
||||
"DEBUG",
|
||||
f"Crawled (200) <GET {OVERRIDEN_URL}> (referer: None)",
|
||||
f"Crawled (200) <GET {OVERRIDDEN_URL}> (referer: None)",
|
||||
),
|
||||
)
|
||||
|
||||
@ -166,7 +166,7 @@ class CrawlTestCase(TestCase):
|
||||
yield crawler.crawl(seed=url, mockserver=self.mockserver)
|
||||
response = crawler.spider.meta["responses"][0]
|
||||
self.assertEqual(response.body, b"Caught ZeroDivisionError")
|
||||
self.assertEqual(response.request.url, OVERRIDEN_URL)
|
||||
self.assertEqual(response.request.url, OVERRIDDEN_URL)
|
||||
|
||||
@defer.inlineCallbacks
|
||||
def test_downloader_middleware_do_not_override_in_process_exception(self):
|
||||
|
@ -227,7 +227,7 @@ class MixinSameOrigin:
|
||||
),
|
||||
("http://example.com:81/page.html", "http://example.com/not-page.html", None),
|
||||
("http://example.com/page.html", "http://example.com:81/not-page.html", None),
|
||||
# Different protocols: do NOT send refferer
|
||||
# Different protocols: do NOT send referrer
|
||||
("https://example.com/page.html", "http://example.com/not-page.html", None),
|
||||
("https://example.com/page.html", "http://not.example.com/", None),
|
||||
("ftps://example.com/urls.zip", "https://example.com/not-page.html", None),
|
||||
@ -750,19 +750,19 @@ class TestRequestMetaUnsafeUrl(MixinUnsafeUrl, TestRefererMiddleware):
|
||||
req_meta = {"referrer_policy": POLICY_UNSAFE_URL}
|
||||
|
||||
|
||||
class TestRequestMetaPredecence001(MixinUnsafeUrl, TestRefererMiddleware):
|
||||
class TestRequestMetaPrecedence001(MixinUnsafeUrl, TestRefererMiddleware):
|
||||
settings = {"REFERRER_POLICY": "scrapy.spidermiddlewares.referer.SameOriginPolicy"}
|
||||
req_meta = {"referrer_policy": POLICY_UNSAFE_URL}
|
||||
|
||||
|
||||
class TestRequestMetaPredecence002(MixinNoReferrer, TestRefererMiddleware):
|
||||
class TestRequestMetaPrecedence002(MixinNoReferrer, TestRefererMiddleware):
|
||||
settings = {
|
||||
"REFERRER_POLICY": "scrapy.spidermiddlewares.referer.NoReferrerWhenDowngradePolicy"
|
||||
}
|
||||
req_meta = {"referrer_policy": POLICY_NO_REFERRER}
|
||||
|
||||
|
||||
class TestRequestMetaPredecence003(MixinUnsafeUrl, TestRefererMiddleware):
|
||||
class TestRequestMetaPrecedence003(MixinUnsafeUrl, TestRefererMiddleware):
|
||||
settings = {
|
||||
"REFERRER_POLICY": "scrapy.spidermiddlewares.referer.OriginWhenCrossOriginPolicy"
|
||||
}
|
||||
@ -888,19 +888,19 @@ class TestSettingsPolicyByName(TestCase):
|
||||
RefererMiddleware(settings)
|
||||
|
||||
|
||||
class TestPolicyHeaderPredecence001(MixinUnsafeUrl, TestRefererMiddleware):
|
||||
class TestPolicyHeaderPrecedence001(MixinUnsafeUrl, TestRefererMiddleware):
|
||||
settings = {"REFERRER_POLICY": "scrapy.spidermiddlewares.referer.SameOriginPolicy"}
|
||||
resp_headers = {"Referrer-Policy": POLICY_UNSAFE_URL.upper()}
|
||||
|
||||
|
||||
class TestPolicyHeaderPredecence002(MixinNoReferrer, TestRefererMiddleware):
|
||||
class TestPolicyHeaderPrecedence002(MixinNoReferrer, TestRefererMiddleware):
|
||||
settings = {
|
||||
"REFERRER_POLICY": "scrapy.spidermiddlewares.referer.NoReferrerWhenDowngradePolicy"
|
||||
}
|
||||
resp_headers = {"Referrer-Policy": POLICY_NO_REFERRER.swapcase()}
|
||||
|
||||
|
||||
class TestPolicyHeaderPredecence003(
|
||||
class TestPolicyHeaderPrecedence003(
|
||||
MixinNoReferrerWhenDowngrade, TestRefererMiddleware
|
||||
):
|
||||
settings = {
|
||||
@ -909,7 +909,7 @@ class TestPolicyHeaderPredecence003(
|
||||
resp_headers = {"Referrer-Policy": POLICY_NO_REFERRER_WHEN_DOWNGRADE.title()}
|
||||
|
||||
|
||||
class TestPolicyHeaderPredecence004(
|
||||
class TestPolicyHeaderPrecedence004(
|
||||
MixinNoReferrerWhenDowngrade, TestRefererMiddleware
|
||||
):
|
||||
"""
|
||||
|
@ -15,6 +15,11 @@ class AsyncioTest(TestCase):
|
||||
)
|
||||
|
||||
def test_install_asyncio_reactor(self):
|
||||
from twisted.internet import reactor as original_reactor
|
||||
|
||||
with warnings.catch_warnings(record=True) as w:
|
||||
install_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")
|
||||
self.assertEqual(len(w), 0)
|
||||
from twisted.internet import reactor
|
||||
|
||||
assert original_reactor == reactor
|
||||
|
@ -74,7 +74,7 @@ class WarnWhenSubclassedTest(unittest.TestCase):
|
||||
self.assertIn("foo.NewClass", str(w[1].message))
|
||||
self.assertIn("bar.OldClass", str(w[1].message))
|
||||
|
||||
def test_subclassing_warns_only_on_direct_childs(self):
|
||||
def test_subclassing_warns_only_on_direct_children(self):
|
||||
Deprecated = create_deprecated_class(
|
||||
"Deprecated", NewName, warn_once=False, warn_category=MyWarning
|
||||
)
|
||||
|
@ -7,17 +7,27 @@ from scrapy.utils.display import pformat, pprint
|
||||
|
||||
class TestDisplay(TestCase):
|
||||
object = {"a": 1}
|
||||
colorized_string = (
|
||||
"{\x1b[33m'\x1b[39;49;00m\x1b[33ma\x1b[39;49;00m\x1b[33m'"
|
||||
"\x1b[39;49;00m: \x1b[34m1\x1b[39;49;00m}\n"
|
||||
)
|
||||
colorized_strings = {
|
||||
(
|
||||
(
|
||||
"{\x1b[33m'\x1b[39;49;00m\x1b[33ma\x1b[39;49;00m\x1b[33m'"
|
||||
"\x1b[39;49;00m: \x1b[34m1\x1b[39;49;00m}"
|
||||
)
|
||||
+ suffix
|
||||
)
|
||||
for suffix in (
|
||||
# https://github.com/pygments/pygments/issues/2313
|
||||
"\n", # pygments ≤ 2.13
|
||||
"\x1b[37m\x1b[39;49;00m\n", # pygments ≥ 2.14
|
||||
)
|
||||
}
|
||||
plain_string = "{'a': 1}"
|
||||
|
||||
@mock.patch("sys.platform", "linux")
|
||||
@mock.patch("sys.stdout.isatty")
|
||||
def test_pformat(self, isatty):
|
||||
isatty.return_value = True
|
||||
self.assertEqual(pformat(self.object), self.colorized_string)
|
||||
self.assertIn(pformat(self.object), self.colorized_strings)
|
||||
|
||||
@mock.patch("sys.stdout.isatty")
|
||||
def test_pformat_dont_colorize(self, isatty):
|
||||
@ -33,7 +43,7 @@ class TestDisplay(TestCase):
|
||||
def test_pformat_old_windows(self, isatty, version):
|
||||
isatty.return_value = True
|
||||
version.return_value = "10.0.14392"
|
||||
self.assertEqual(pformat(self.object), self.colorized_string)
|
||||
self.assertIn(pformat(self.object), self.colorized_strings)
|
||||
|
||||
@mock.patch("sys.platform", "win32")
|
||||
@mock.patch("scrapy.utils.display._enable_windows_terminal_processing")
|
||||
@ -55,7 +65,7 @@ class TestDisplay(TestCase):
|
||||
isatty.return_value = True
|
||||
version.return_value = "10.0.14393"
|
||||
terminal_processing.return_value = True
|
||||
self.assertEqual(pformat(self.object), self.colorized_string)
|
||||
self.assertIn(pformat(self.object), self.colorized_strings)
|
||||
|
||||
@mock.patch("sys.platform", "linux")
|
||||
@mock.patch("sys.stdout.isatty")
|
||||
|
@ -159,7 +159,7 @@ class UtilsPythonTestCase(unittest.TestCase):
|
||||
b = Obj()
|
||||
# no attributes given return False
|
||||
self.assertFalse(equal_attributes(a, b, []))
|
||||
# not existent attributes
|
||||
# nonexistent attributes
|
||||
self.assertFalse(equal_attributes(a, b, ["x", "y"]))
|
||||
|
||||
a.x = 1
|
||||
|
16
tox.ini
16
tox.ini
@ -32,7 +32,7 @@ download = true
|
||||
commands =
|
||||
pytest --cov=scrapy --cov-report=xml --cov-report= {posargs:--durations=10 docs scrapy tests}
|
||||
install_command =
|
||||
pip install -U -ctests/upper-constraints.txt {opts} {packages}
|
||||
python -I -m pip install -ctests/upper-constraints.txt {opts} {packages}
|
||||
|
||||
[testenv:typing]
|
||||
basepython = python3
|
||||
@ -63,8 +63,7 @@ commands =
|
||||
flake8 {posargs:docs scrapy tests}
|
||||
|
||||
[testenv:pylint]
|
||||
# reppy does not support Python 3.9+
|
||||
basepython = python3.8
|
||||
basepython = python3
|
||||
deps =
|
||||
{[testenv:extra-deps]deps}
|
||||
pylint==2.15.6
|
||||
@ -75,13 +74,14 @@ commands =
|
||||
basepython = python3
|
||||
deps =
|
||||
twine==4.0.1
|
||||
build==0.9.0
|
||||
commands =
|
||||
python setup.py sdist
|
||||
python -m build --sdist
|
||||
twine check dist/*
|
||||
|
||||
[pinned]
|
||||
deps =
|
||||
cryptography==3.3
|
||||
cryptography==3.4.6
|
||||
cssselect==0.9.1
|
||||
h2==3.0
|
||||
itemadapter==0.1.0
|
||||
@ -106,7 +106,7 @@ deps =
|
||||
setenv =
|
||||
_SCRAPY_PINNED=true
|
||||
install_command =
|
||||
pip install -U {opts} {packages}
|
||||
python -I -m pip install {opts} {packages}
|
||||
|
||||
[testenv:pinned]
|
||||
deps =
|
||||
@ -126,8 +126,7 @@ setenv =
|
||||
{[pinned]setenv}
|
||||
|
||||
[testenv:extra-deps]
|
||||
# reppy does not support Python 3.9+
|
||||
basepython = python3.8
|
||||
basepython = python3
|
||||
deps =
|
||||
{[testenv]deps}
|
||||
boto
|
||||
@ -135,7 +134,6 @@ deps =
|
||||
# Twisted[http2] currently forces old mitmproxy because of h2 version
|
||||
# restrictions in their deps, so we need to pin old markupsafe here too.
|
||||
markupsafe < 2.1.0
|
||||
reppy
|
||||
robotexclusionrulesparser
|
||||
Pillow>=4.0.0
|
||||
Twisted[http2]>=17.9.0
|
||||
|
Loading…
x
Reference in New Issue
Block a user