1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-06 11:00:46 +00:00

reaplying black, fixing conflicts and ignoring bandit checks on test directory

This commit is contained in:
Emmanuel Rondan 2023-01-20 10:54:46 -03:00
parent 23e8b553b4
commit 8ee4817471
49 changed files with 326 additions and 139 deletions

View File

@ -17,3 +17,4 @@ skips:
- B503
- B603
- B605
exclude_dirs: ['tests']

View File

@ -14,9 +14,7 @@ jobs:
- python-version: "3.11"
env:
TOXENV: flake8
# Pylint requires installing reppy, which does not support Python 3.9
# https://github.com/seomoz/reppy/issues/122
- python-version: 3.8
- python-version: "3.11"
env:
TOXENV: pylint
- python-version: 3.7

View File

@ -24,8 +24,8 @@ jobs:
- name: Publish to PyPI
if: steps.check-release-tag.outputs.release_tag == 'true'
run: |
pip install --upgrade setuptools wheel twine
python setup.py sdist bdist_wheel
pip install --upgrade build twine
python -m build
export TWINE_USERNAME=__token__
export TWINE_PASSWORD=${{ secrets.PYPI_TOKEN }}
twine upload dist/*

View File

@ -38,10 +38,7 @@ jobs:
env:
TOXENV: pypy3-pinned
# extras
# extra-deps includes reppy, which does not support Python 3.9
# https://github.com/seomoz/reppy/issues/122
- python-version: 3.8
- python-version: "3.11"
env:
TOXENV: extra-deps

View File

@ -4700,7 +4700,7 @@ Scrapy 0.22.1 (released 2014-02-08)
- BaseSgmlLinkExtractor: Added unit test of a link with an inner tag (:commit:`c1cb418`)
- BaseSgmlLinkExtractor: Fixed unknown_endtag() so that it only set current_link=None when the end tag match the opening tag (:commit:`7e4d627`)
- Fix tests for Travis-CI build (:commit:`76c7e20`)
- replace unencodable codepoints with html entities. fixes #562 and #285 (:commit:`5f87b17`)
- replace unencodeable codepoints with html entities. fixes #562 and #285 (:commit:`5f87b17`)
- RegexLinkExtractor: encode URL unicode value when creating Links (:commit:`d0ee545`)
- Updated the tutorial crawl output with latest output. (:commit:`8da65de`)
- Updated shell docs with the crawler reference and fixed the actual shell output. (:commit:`875b9ab`)
@ -4725,7 +4725,7 @@ Enhancements
- [**Backward incompatible**] Switched HTTPCacheMiddleware backend to filesystem (:issue:`541`)
To restore old backend set ``HTTPCACHE_STORAGE`` to ``scrapy.contrib.httpcache.DbmCacheStorage``
- Proxy \https:// urls using CONNECT method (:issue:`392`, :issue:`397`)
- Add a middleware to crawl ajax crawleable pages as defined by google (:issue:`343`)
- Add a middleware to crawl ajax crawlable pages as defined by google (:issue:`343`)
- Rename scrapy.spider.BaseSpider to scrapy.spider.Spider (:issue:`510`, :issue:`519`)
- Selectors register EXSLT namespaces by default (:issue:`472`)
- Unify item loaders similar to selectors renaming (:issue:`461`)
@ -4905,7 +4905,7 @@ Scrapy 0.18.0 (released 2013-08-09)
-----------------------------------
- Lot of improvements to testsuite run using Tox, including a way to test on pypi
- Handle GET parameters for AJAX crawleable urls (:commit:`3fe2a32`)
- Handle GET parameters for AJAX crawlable urls (:commit:`3fe2a32`)
- Use lxml recover option to parse sitemaps (:issue:`347`)
- Bugfix cookie merging by hostname and not by netloc (:issue:`352`)
- Support disabling ``HttpCompressionMiddleware`` using a flag setting (:issue:`359`)
@ -4939,8 +4939,8 @@ Scrapy 0.18.0 (released 2013-08-09)
- Added ``--pdb`` option to ``scrapy`` command line tool
- Added :meth:`XPathSelector.remove_namespaces <scrapy.selector.Selector.remove_namespaces>` which allows to remove all namespaces from XML documents for convenience (to work with namespace-less XPaths). Documented in :ref:`topics-selectors`.
- Several improvements to spider contracts
- New default middleware named MetaRefreshMiddldeware that handles meta-refresh html tag redirections,
- MetaRefreshMiddldeware and RedirectMiddleware have different priorities to address #62
- New default middleware named MetaRefreshMiddleware that handles meta-refresh html tag redirections,
- MetaRefreshMiddleware and RedirectMiddleware have different priorities to address #62
- added from_crawler method to spiders
- added system tests with mock server
- more improvements to macOS compatibility (thanks Alex Cepoi)
@ -5082,7 +5082,7 @@ Scrapy changes:
- promoted :ref:`topics-djangoitem` to main contrib
- LogFormatter method now return dicts(instead of strings) to support lazy formatting (:issue:`164`, :commit:`dcef7b0`)
- downloader handlers (:setting:`DOWNLOAD_HANDLERS` setting) now receive settings as the first argument of the ``__init__`` method
- replaced memory usage acounting with (more portable) `resource`_ module, removed ``scrapy.utils.memory`` module
- replaced memory usage accounting with (more portable) `resource`_ module, removed ``scrapy.utils.memory`` module
- removed signal: ``scrapy.mail.mail_sent``
- removed ``TRACK_REFS`` setting, now :ref:`trackrefs <topics-leaks-trackrefs>` is always enabled
- DBM is now the default storage backend for HTTP cache middleware
@ -5148,7 +5148,7 @@ Scrapy 0.14
New features and settings
~~~~~~~~~~~~~~~~~~~~~~~~~
- Support for `AJAX crawleable urls`_
- Support for `AJAX crawlable urls`_
- New persistent scheduler that stores requests on disk, allowing to suspend and resume crawls (:rev:`2737`)
- added ``-o`` option to ``scrapy crawl``, a shortcut for dumping scraped items into a file (or standard output using ``-``)
- Added support for passing custom settings to Scrapyd ``schedule.json`` api (:rev:`2779`, :rev:`2783`)
@ -5408,7 +5408,7 @@ Backward-incompatible changes
- Renamed setting: ``REQUESTS_PER_DOMAIN`` to ``CONCURRENT_REQUESTS_PER_SPIDER`` (:rev:`1830`, :rev:`1844`)
- Renamed setting: ``CONCURRENT_DOMAINS`` to ``CONCURRENT_SPIDERS`` (:rev:`1830`)
- Refactored HTTP Cache middleware
- HTTP Cache middleware has been heavilty refactored, retaining the same functionality except for the domain sectorization which was removed. (:rev:`1843` )
- HTTP Cache middleware has been heavily refactored, retaining the same functionality except for the domain sectorization which was removed. (:rev:`1843` )
- Renamed exception: ``DontCloseDomain`` to ``DontCloseSpider`` (:rev:`1859` | #120)
- Renamed extension: ``DelayedCloseDomain`` to ``SpiderCloseDelay`` (:rev:`1861` | #121)
- Removed obsolete ``scrapy.utils.markup.remove_escape_chars`` function - use ``scrapy.utils.markup.replace_escape_chars`` instead (:rev:`1865`)
@ -5419,7 +5419,7 @@ Scrapy 0.7
First release of Scrapy.
.. _AJAX crawleable urls: https://developers.google.com/search/docs/ajax-crawling/docs/getting-started?csw=1
.. _AJAX crawlable urls: https://developers.google.com/search/docs/ajax-crawling/docs/getting-started?csw=1
.. _botocore: https://github.com/boto/botocore
.. _chunked transfer encoding: https://en.wikipedia.org/wiki/Chunked_transfer_encoding
.. _ClientForm: http://wwwsearch.sourceforge.net/old/ClientForm/

View File

@ -636,19 +636,30 @@ DOWNLOAD_DELAY
Default: ``0``
The amount of time (in secs) that the downloader should wait before downloading
consecutive pages from the same website. This can be used to throttle the
crawling speed to avoid hitting servers too hard. Decimal numbers are
supported. Example::
Minimum seconds to wait between 2 consecutive requests to the same domain.
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
Use :setting:`DOWNLOAD_DELAY` to throttle your crawling speed, to avoid hitting
servers too hard.
Decimal numbers are supported. For example, to send a maximum of 4 requests
every 10 seconds::
DOWNLOAD_DELAY = 2.5
This setting is also affected by the :setting:`RANDOMIZE_DOWNLOAD_DELAY`
setting (which is enabled by default). By default, Scrapy doesn't wait a fixed
amount of time between requests, but uses a random interval between 0.5 * :setting:`DOWNLOAD_DELAY` and 1.5 * :setting:`DOWNLOAD_DELAY`.
setting, which is enabled by default.
When :setting:`CONCURRENT_REQUESTS_PER_IP` is non-zero, delays are enforced
per ip address instead of per domain.
per IP address instead of per domain.
Note that :setting:`DOWNLOAD_DELAY` can lower the effective per-domain
concurrency below :setting:`CONCURRENT_REQUESTS_PER_DOMAIN`. If the response
time of a domain is lower than :setting:`DOWNLOAD_DELAY`, the effective
concurrency for that domain is 1. When testing throttling configurations, it
usually makes sense to lower :setting:`CONCURRENT_REQUESTS_PER_DOMAIN` first,
and only increase :setting:`DOWNLOAD_DELAY` once
:setting:`CONCURRENT_REQUESTS_PER_DOMAIN` is 1 but a higher throttling is
desired.
.. _spider-download_delay-attribute:
@ -656,6 +667,11 @@ per ip address instead of per domain.
This delay can be set per spider using :attr:`download_delay` spider attribute.
It is also possible to change this setting per domain, although it requires
non-trivial code. See the implementation of the :ref:`AutoThrottle
<topics-autothrottle>` extension for an example.
.. setting:: DOWNLOAD_HANDLERS
DOWNLOAD_HANDLERS

View File

@ -99,7 +99,7 @@ scrapy.Spider
.. attribute:: crawler
This attribute is set by the :meth:`from_crawler` class method after
initializating the class, and links to the
initializing the class, and links to the
:class:`~scrapy.crawler.Crawler` object to which this spider instance is
bound.

View File

@ -1,9 +1,10 @@
"""
A spider that generate light requests to meassure QPS throughput
A spider that generate light requests to measure QPS throughput
usage:
scrapy runspider qpsclient.py --loglevel=INFO --set RANDOMIZE_DOWNLOAD_DELAY=0 --set CONCURRENT_REQUESTS=50 -a qps=10 -a latency=0.3
scrapy runspider qpsclient.py --loglevel=INFO --set RANDOMIZE_DOWNLOAD_DELAY=0
--set CONCURRENT_REQUESTS=50 -a qps=10 -a latency=0.3
"""

View File

@ -24,7 +24,7 @@ class ScrapyArgumentParser(argparse.ArgumentParser):
def _iter_command_classes(module_name):
# TODO: add `name` attribute to commands and and merge this function with
# TODO: add `name` attribute to commands and merge this function with
# scrapy.utils.spider.iter_spider_classes
for module in walk_modules(module_name):
for obj in vars(module).values():

View File

@ -83,7 +83,9 @@ class ScrapyClientContextFactory(BrowserLikePolicyForHTTPS):
# kept for old-style HTTP/1.0 downloader context twisted calls,
# e.g. connectSSL()
def getContext(self, hostname=None, port=None):
return self.getCertificateOptions().getContext()
ctx = self.getCertificateOptions().getContext()
ctx.set_options(0x4) # OP_LEGACY_SERVER_CONNECT
return ctx
def creatorForNetloc(self, hostname, port):
return ScrapyClientTLSOptions(

View File

@ -23,8 +23,8 @@ METHOD_TLSv12 = "TLSv1.2"
openssl_methods = {
METHOD_TLS: SSL.SSLv23_METHOD, # protocol negotiation (recommended)
METHOD_TLSv10: SSL.TLSv1_METHOD, # TLS 1.0 only
METHOD_TLSv11: getattr(SSL, "TLSv1_1_METHOD", 5), # TLS 1.1 only
METHOD_TLSv12: getattr(SSL, "TLSv1_2_METHOD", 6), # TLS 1.2 only
METHOD_TLSv11: SSL.TLSv1_1_METHOD, # TLS 1.1 only
METHOD_TLSv12: SSL.TLSv1_2_METHOD, # TLS 1.2 only
}

View File

@ -101,7 +101,7 @@ class ScrapyHTTPPageGetter(HTTPClient):
# This class used to inherit from Twisteds
# twisted.web.client.HTTPClientFactory. When that class was deprecated in
# Twisted (https://github.com/twisted/twisted/pull/643), we merged its
# non-overriden code into this class.
# non-overridden code into this class.
class ScrapyHTTPClientFactory(ClientFactory):
protocol = ScrapyHTTPPageGetter

View File

@ -348,7 +348,7 @@ class Stream:
def receive_headers(self, headers: List[HeaderTuple]) -> None:
for name, value in headers:
self._response["headers"][name] = value
self._response["headers"].appendlist(name, value)
# Check if we exceed the allowed max data size which can be received
expected_size = int(self._response["headers"].get(b"Content-Length", -1))

View File

@ -384,11 +384,11 @@ class FeedExporter:
return defer.DeferredList(deferred_list) if deferred_list else None
def _close_slot(self, slot, spider):
slot.finish_exporting()
if not slot.itemcount and not slot.store_empty:
# We need to call slot.storage.store nonetheless to get the file
# properly closed.
return defer.maybeDeferred(slot.storage.store, slot.file)
slot.finish_exporting()
logmsg = f"{slot.format} feed ({slot.itemcount} items) in: {slot.uri}"
d = defer.maybeDeferred(slot.storage.store, slot.file)

View File

@ -196,7 +196,7 @@ class RFC2616Policy:
if response.status in (300, 301, 308):
return self.MAXAGE
# Insufficient information to compute fresshness lifetime
# Insufficient information to compute freshness lifetime
return 0
def _compute_current_age(self, response, request, now):

View File

@ -200,7 +200,7 @@ def _select_value(ele: SelectElement, n: str, v: str):
o = ele.value_options
return (n, o[0]) if o else (None, None)
if v is not None and multiple:
# This is a workround to bug in lxml fixed 2.3.1
# This is a workaround to bug in lxml fixed 2.3.1
# fix https://github.com/lxml/lxml/commit/57f49eed82068a20da3db8f1b18ae00c1bab8b12#L1L1139
selected_options = ele.xpath(".//option[@selected]")
values = [(o.get("value") or o.text or "").strip() for o in selected_options]

View File

@ -226,7 +226,8 @@ class LxmlLinkExtractor:
Only links that match the settings passed to the ``__init__`` method of
the link extractor are returned.
Duplicate links are omitted.
Duplicate links are omitted if the ``unique`` attribute is set to ``True``,
otherwise they are returned.
"""
base_url = get_base_url(response)
if self.restrict_xpaths:
@ -239,4 +240,6 @@ class LxmlLinkExtractor:
for doc in docs:
links = self._extract_links(doc, response.url, response.encoding, base_url)
all_links.extend(self._process_links(links))
return unique_list(all_links)
if self.link_extractor.unique:
return unique_list(all_links)
return all_links

View File

@ -151,8 +151,8 @@ class ImagesPipeline(FilesPipeline):
)
if self._deprecated_convert_image:
warnings.warn(
f"{self.__class__.__name__}.convert_image() method overriden in a deprecated way, "
"overriden method does not accept response_body argument.",
f"{self.__class__.__name__}.convert_image() method overridden in a deprecated way, "
"overridden method does not accept response_body argument.",
category=ScrapyDeprecationWarning,
)

View File

@ -177,7 +177,11 @@ class Shell:
def inspect_response(response, spider):
"""Open a shell to inspect the given response"""
# Shell.start removes the SIGINT handler, so save it and re-add it after
# the shell has closed
sigint_handler = signal.getsignal(signal.SIGINT)
Shell(spider.crawler).start(response=response, spider=spider)
signal.signal(signal.SIGINT, sigint_handler)
def _request_deferred(request):

View File

@ -1,5 +1,4 @@
from functools import wraps
from collections import OrderedDict
def _embed_ipython_shell(namespace={}, banner=""):
@ -70,14 +69,12 @@ def _embed_standard_shell(namespace={}, banner=""):
return wrapper
DEFAULT_PYTHON_SHELLS = OrderedDict(
[
("ptpython", _embed_ptpython_shell),
("ipython", _embed_ipython_shell),
("bpython", _embed_bpython_shell),
("python", _embed_standard_shell),
]
)
DEFAULT_PYTHON_SHELLS = {
"ptpython": _embed_ptpython_shell,
"ipython": _embed_ipython_shell,
"bpython": _embed_bpython_shell,
"python": _embed_standard_shell,
}
def get_shell_embed_func(shells=None, known_shells=None):

View File

@ -26,10 +26,7 @@ from twisted.python import failure
from twisted.python.failure import Failure
from scrapy.exceptions import IgnoreRequest
from scrapy.utils.reactor import (
is_asyncio_reactor_installed,
get_asyncio_event_loop_policy,
)
from scrapy.utils.reactor import is_asyncio_reactor_installed, _get_asyncio_event_loop
def defer_fail(_failure: Failure) -> Deferred:
@ -290,7 +287,7 @@ def deferred_from_coro(o) -> Any:
# that use asyncio, e.g. "await asyncio.sleep(1)"
return ensureDeferred(o)
# wrapping the coroutine into a Future and then into a Deferred, this requires AsyncioSelectorReactor
event_loop = get_asyncio_event_loop_policy().get_event_loop()
event_loop = _get_asyncio_event_loop()
return Deferred.fromFuture(asyncio.ensure_future(o, loop=event_loop))
return o
@ -343,8 +340,7 @@ def deferred_to_future(d: Deferred) -> Future:
d = treq.get('https://example.com/additional')
additional_response = await deferred_to_future(d)
"""
policy = get_asyncio_event_loop_policy()
return d.asFuture(policy.get_event_loop())
return d.asFuture(_get_asyncio_event_loop())
def maybe_deferred_to_future(d: Deferred) -> Union[Deferred, Future]:

View File

@ -1,6 +1,7 @@
import asyncio
import sys
from contextlib import suppress
from warnings import catch_warnings, filterwarnings
from twisted.internet import asyncioreactor, error
@ -83,6 +84,10 @@ def install_reactor(reactor_path, event_loop_path=None):
installer()
def _get_asyncio_event_loop():
return set_asyncio_event_loop(None)
def set_asyncio_event_loop(event_loop_path):
"""Sets and returns the event loop with specified import path."""
policy = get_asyncio_event_loop_policy()
@ -92,11 +97,26 @@ def set_asyncio_event_loop(event_loop_path):
asyncio.set_event_loop(event_loop)
else:
try:
event_loop = policy.get_event_loop()
with catch_warnings():
# In Python 3.10.9, 3.11.1, 3.12 and 3.13, a DeprecationWarning
# is emitted about the lack of a current event loop, because in
# Python 3.14 and later `get_event_loop` will raise a
# RuntimeError in that event. Because our code is already
# prepared for that future behavior, we ignore the deprecation
# warning.
filterwarnings(
"ignore",
message="There is no current event loop",
category=DeprecationWarning,
)
event_loop = policy.get_event_loop()
except RuntimeError:
# `get_event_loop` is expected to fail when called from a new thread
# with no asyncio event loop yet installed. Such is the case when
# called from `scrapy shell`
# `get_event_loop` raises RuntimeError when called with no asyncio
# event loop yet installed in the following scenarios:
# - From a thread other than the main thread. For example, when
# using ``scrapy shell``.
# - Previsibly on Python 3.14 and later.
# https://github.com/python/cpython/issues/100160#issuecomment-1345581902
event_loop = policy.new_event_loop()
asyncio.set_event_loop(event_loop)
return event_loop

View File

@ -40,7 +40,7 @@ def get_meta_refresh(
response: "scrapy.http.response.text.TextResponse",
ignore_tags: Optional[Iterable[str]] = ("script", "noscript"),
) -> Union[Tuple[None, None], Tuple[float, str]]:
"""Parse the http-equiv refrsh parameter from the given response"""
"""Parse the http-equiv refresh parameter from the given response"""
if response not in _metaref_cache:
text = response.text[0:4096]
_metaref_cache[response] = html.get_meta_refresh(

View File

@ -1,14 +1,9 @@
import OpenSSL
import OpenSSL.SSL
import OpenSSL._util as pyOpenSSLutil
from scrapy.utils.python import to_unicode
# The OpenSSL symbol is present since 1.1.1 but it's not currently supported in any version of pyOpenSSL.
# Using the binding directly, as this code does, requires cryptography 2.4.
SSL_OP_NO_TLSv1_3 = getattr(pyOpenSSLutil.lib, "SSL_OP_NO_TLSv1_3", 0)
def ffi_buf_to_string(buf):
return to_unicode(pyOpenSSLutil.ffi.string(buf))
@ -24,11 +19,6 @@ def x509name_to_string(x509name):
def get_temp_key_info(ssl_object):
if not hasattr(
pyOpenSSLutil.lib, "SSL_get_server_tmp_key"
): # requires OpenSSL 1.0.2
return None
# adapted from OpenSSL apps/s_cb.c::ssl_print_tmp_key()
temp_key_p = pyOpenSSLutil.ffi.new("EVP_PKEY **")
if not pyOpenSSLutil.lib.SSL_get_server_tmp_key(ssl_object, temp_key_p):

View File

@ -48,7 +48,7 @@ def parse_url(url, encoding=None):
def escape_ajax(url):
"""
Return the crawleable url according to:
Return the crawlable url according to:
https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
>>> escape_ajax("www.example.com/ajax.html#!key=value")

View File

@ -148,7 +148,7 @@ Another example could be for building URL canonicalizers:
::
#!python
class CanonializeUrl(LegSpider):
class CanonicalizeUrl(LegSpider):
def process_request(self, request):
curl = canonicalize_url(request.url, rules=self.spider.canonicalization_rules)

View File

@ -321,7 +321,7 @@ Another example could be for building URL canonicalizers:
::
#!python
class CanonializeUrl(object):
class CanonicalizeUrl(object):
def process_request(self, request, response, spider):
curl = canonicalize_url(request.url,
@ -594,18 +594,18 @@ A middleware to Scrape data using Parsley as described in UsingParsley
class ParsleyExtractor(object):
def __init__(self, parslet_json_code):
parslet = json.loads(parselet_json_code)
def __init__(self, parsley_json_code):
parsley = json.loads(parselet_json_code)
class ParsleyItem(Item):
def __init__(self, *a, **kw):
for name in parslet.keys():
for name in parsley.keys():
self.fields[name] = Field()
super(ParsleyItem, self).__init__(*a, **kw)
self.item_class = ParsleyItem
self.parsley = PyParsley(parslet, output='python')
self.parsley = PyParsley(parsley, output='python')
def process_response(self, response, request, spider):
return self.item_class(self.parsly.parse(string=response.body))
return self.item_class(self.parsley.parse(string=response.body))

View File

@ -79,7 +79,7 @@ If it raises an exception, Scrapy will print it and exit.
Examples::
def addon_configure(settings):
settings.overrides['DOWNLADER_MIDDLEWARES'].update({
settings.overrides['DOWNLOADER_MIDDLEWARES'].update({
'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
})

View File

@ -19,7 +19,7 @@ def has_environment_marker_platform_impl_support():
install_requires = [
"Twisted>=18.9.0",
"cryptography>=3.3",
"cryptography>=3.4.6",
"cssselect>=0.9.1",
"itemloaders>=1.0.1",
"parsel>=1.5.0",

View File

@ -19,7 +19,6 @@ from twisted.web.static import File
from twisted.web.util import redirectTo
from scrapy.utils.python import to_bytes, to_unicode
from scrapy.utils.ssl import SSL_OP_NO_TLSv1_3
from scrapy.utils.test import get_testenv
@ -358,7 +357,7 @@ def ssl_context_factory(
if cipher_string:
ctx = factory.getContext()
# disabling TLS1.3 because it unconditionally enables some strong ciphers
ctx.set_options(SSL.OP_CIPHER_SERVER_PREFERENCE | SSL_OP_NO_TLSv1_3)
ctx.set_options(SSL.OP_CIPHER_SERVER_PREFERENCE | SSL.OP_NO_TLSv1_3)
ctx.set_cipher_list(to_bytes(cipher_string))
return factory

View File

@ -11,6 +11,6 @@ class ZeroDivisionErrorPipeline:
return item
class ProcessWithZeroDivisionErrorPipiline:
class ProcessWithZeroDivisionErrorPipeline:
def process_item(self, item, spider):
1 / 0

View File

@ -13,6 +13,7 @@
</div>
<a href='http://example.com/sample3.html' title='sample 3'>sample 3 text</a>
<a href='sample3.html'>sample 3 repetition</a>
<a href='sample3.html'>sample 3 repetition</a>
<a href='sample3.html#foo'>sample 3 repetition with fragment</a>
<a href='http://www.google.com/something'></a>
<a href='http://example.com/innertag.html'><strong>inner</strong> tag</a>

View File

@ -336,7 +336,7 @@ class StartprojectTemplatesTest(ProjectTest):
self.assertEqual(actual_permissions, expected_permissions)
def test_startproject_permissions_unchanged_in_destination(self):
"""Check that pre-existing folders and files in the destination folder
"""Check that preexisting folders and files in the destination folder
do not see their permissions modified."""
scrapy_path = scrapy.__path__[0]
project_template = Path(scrapy_path, "templates", "project")

View File

@ -154,7 +154,7 @@ class CrawlTestCase(TestCase):
raise unittest.SkipTest("Non-existing hosts are resolvable")
crawler = get_crawler(SimpleSpider)
with LogCapture() as log:
# try to fetch the homepage of a non-existent domain
# try to fetch the homepage of a nonexistent domain
yield crawler.crawl(
"http://dns.resolution.invalid./", mockserver=self.mockserver
)
@ -183,7 +183,7 @@ class CrawlTestCase(TestCase):
self.assertIs(record.exc_info[0], ZeroDivisionError)
@defer.inlineCallbacks
def test_start_requests_lazyness(self):
def test_start_requests_laziness(self):
settings = {"CONCURRENT_REQUESTS": 1}
crawler = get_crawler(BrokenStartRequestsSpider, settings)
yield crawler.crawl(mockserver=self.mockserver)

View File

@ -209,6 +209,12 @@ class LargeChunkedFileResource(resource.Resource):
return server.NOT_DONE_YET
class DuplicateHeaderResource(resource.Resource):
def render(self, request):
request.responseHeaders.setRawHeaders(b"Set-Cookie", [b"a=b", b"c=d"])
return b""
class HttpTestCase(unittest.TestCase):
scheme = "http"
download_handler_cls: Type = HTTPDownloadHandler
@ -234,6 +240,7 @@ class HttpTestCase(unittest.TestCase):
r.putChild(b"contentlength", ContentLengthHeaderResource())
r.putChild(b"nocontenttype", EmptyContentTypeHeaderResource())
r.putChild(b"largechunkedfile", LargeChunkedFileResource())
r.putChild(b"duplicate-header", DuplicateHeaderResource())
r.putChild(b"echo", Echo())
self.site = server.Site(r, timeout=None)
self.wrapper = WrappingFactory(self.site)
@ -407,6 +414,16 @@ class HttpTestCase(unittest.TestCase):
HtmlResponse,
)
def test_get_duplicate_header(self):
def _test(response):
self.assertEqual(
response.headers.getlist(b"Set-Cookie"),
[b"a=b", b"c=d"],
)
request = Request(self.getURL("duplicate-header"))
return self.download_request(request, Spider("foo")).addCallback(_test)
class Http10TestCase(HttpTestCase):
"""HTTP 1.0 test case"""
@ -1095,9 +1112,9 @@ class BaseFTPTestCase(unittest.TestCase):
return self._add_test_callbacks(d, _test)
def test_ftp_download_notexist(self):
def test_ftp_download_nonexistent(self):
request = Request(
url=f"ftp://127.0.0.1:{self.portNum}/notexist.txt", meta=self.req_meta
url=f"ftp://127.0.0.1:{self.portNum}/nonexistent.txt", meta=self.req_meta
)
d = self.download_handler.download_request(request, None)

View File

@ -19,7 +19,7 @@ class UserAgentMiddlewareTest(TestCase):
self.assertEqual(req.headers["User-Agent"], b"default_useragent")
def test_remove_agent(self):
# settings UESR_AGENT to None should remove the user agent
# settings USER_AGENT to None should remove the user agent
spider, mw = self.get_spider_and_mw("default_useragent")
spider.user_agent = None
mw.spider_opened(spider)

View File

@ -109,7 +109,7 @@ class DataClassItemsSpider(TestSpider):
class ItemZeroDivisionErrorSpider(TestSpider):
custom_settings = {
"ITEM_PIPELINES": {
"tests.pipelines.ProcessWithZeroDivisionErrorPipiline": 300,
"tests.pipelines.ProcessWithZeroDivisionErrorPipeline": 300,
}
}

View File

@ -33,8 +33,9 @@ from zope.interface.verify import verifyObject
import scrapy
from scrapy.exceptions import NotConfigured, ScrapyDeprecationWarning
from scrapy.exporters import CsvItemExporter
from scrapy.exporters import CsvItemExporter, JsonItemExporter
from scrapy.extensions.feedexport import (
_FeedSlot,
BlockingFeedStorage,
FeedExporter,
FileFeedStorage,
@ -664,6 +665,50 @@ class FeedExportTestBase(ABC, unittest.TestCase):
return result
class InstrumentedFeedSlot(_FeedSlot):
"""Instrumented _FeedSlot subclass for keeping track of calls to
start_exporting and finish_exporting."""
def start_exporting(self):
self.update_listener("start")
super().start_exporting()
def finish_exporting(self):
self.update_listener("finish")
super().finish_exporting()
@classmethod
def subscribe__listener(cls, listener):
cls.update_listener = listener.update
class IsExportingListener:
"""When subscribed to InstrumentedFeedSlot, keeps track of when
a call to start_exporting has been made without a closing call to
finish_exporting and when a call to finish_exporting has been made
before a call to start_exporting."""
def __init__(self):
self.start_without_finish = False
self.finish_without_start = False
def update(self, method):
if method == "start":
self.start_without_finish = True
elif method == "finish":
if self.start_without_finish:
self.start_without_finish = False
else:
self.finish_before_start = True
class ExceptionJsonItemExporter(JsonItemExporter):
"""JsonItemExporter that throws an exception every time export_item is called."""
def export_item(self, _):
raise Exception("foo")
class FeedExportTest(FeedExportTestBase):
__test__ = True
@ -909,6 +954,84 @@ class FeedExportTest(FeedExportTestBase):
data = yield self.exported_no_data(settings)
self.assertEqual(b"", data[fmt])
@defer.inlineCallbacks
def test_start_finish_exporting_items(self):
items = [
self.MyItem({"foo": "bar1", "egg": "spam1"}),
]
settings = {
"FEEDS": {
self._random_temp_filename(): {"format": "json"},
},
"FEED_EXPORT_INDENT": None,
}
listener = IsExportingListener()
InstrumentedFeedSlot.subscribe__listener(listener)
with mock.patch("scrapy.extensions.feedexport._FeedSlot", InstrumentedFeedSlot):
_ = yield self.exported_data(items, settings)
self.assertFalse(listener.start_without_finish)
self.assertFalse(listener.finish_without_start)
@defer.inlineCallbacks
def test_start_finish_exporting_no_items(self):
items = []
settings = {
"FEEDS": {
self._random_temp_filename(): {"format": "json"},
},
"FEED_EXPORT_INDENT": None,
}
listener = IsExportingListener()
InstrumentedFeedSlot.subscribe__listener(listener)
with mock.patch("scrapy.extensions.feedexport._FeedSlot", InstrumentedFeedSlot):
_ = yield self.exported_data(items, settings)
self.assertFalse(listener.start_without_finish)
self.assertFalse(listener.finish_without_start)
@defer.inlineCallbacks
def test_start_finish_exporting_items_exception(self):
items = [
self.MyItem({"foo": "bar1", "egg": "spam1"}),
]
settings = {
"FEEDS": {
self._random_temp_filename(): {"format": "json"},
},
"FEED_EXPORTERS": {"json": ExceptionJsonItemExporter},
"FEED_EXPORT_INDENT": None,
}
listener = IsExportingListener()
InstrumentedFeedSlot.subscribe__listener(listener)
with mock.patch("scrapy.extensions.feedexport._FeedSlot", InstrumentedFeedSlot):
_ = yield self.exported_data(items, settings)
self.assertFalse(listener.start_without_finish)
self.assertFalse(listener.finish_without_start)
@defer.inlineCallbacks
def test_start_finish_exporting_no_items_exception(self):
items = []
settings = {
"FEEDS": {
self._random_temp_filename(): {"format": "json"},
},
"FEED_EXPORTERS": {"json": ExceptionJsonItemExporter},
"FEED_EXPORT_INDENT": None,
}
listener = IsExportingListener()
InstrumentedFeedSlot.subscribe__listener(listener)
with mock.patch("scrapy.extensions.feedexport._FeedSlot", InstrumentedFeedSlot):
_ = yield self.exported_data(items, settings)
self.assertFalse(listener.start_without_finish)
self.assertFalse(listener.finish_without_start)
@defer.inlineCallbacks
def test_export_no_items_store_empty(self):
formats = (

View File

@ -399,7 +399,7 @@ class RequestTest(unittest.TestCase):
)
self.assertEqual(r.method, "DELETE")
# If `ignore_unknon_options` is set to `False` it raises an error with
# If `ignore_unknown_options` is set to `False` it raises an error with
# the unknown options: --foo and -z
self.assertRaises(
ValueError,
@ -997,7 +997,7 @@ class FormRequestTest(RequestTest):
fs = _qs(r1)
self.assertEqual(fs, {b"four": [b"4"], b"three": [b"3"]})
def test_from_response_formname_notexist(self):
def test_from_response_formname_nonexistent(self):
response = _buildresponse(
"""<form name="form1" action="post.php" method="POST">
<input type="hidden" name="one" value="1">
@ -1044,7 +1044,7 @@ class FormRequestTest(RequestTest):
fs = _qs(r1)
self.assertEqual(fs, {b"four": [b"4"], b"three": [b"3"]})
def test_from_response_formname_notexists_fallback_formid(self):
def test_from_response_formname_nonexistent_fallback_formid(self):
response = _buildresponse(
"""<form action="post.php" method="POST">
<input type="hidden" name="one" value="1">
@ -1062,7 +1062,7 @@ class FormRequestTest(RequestTest):
fs = _qs(r1)
self.assertEqual(fs, {b"four": [b"4"], b"three": [b"3"]})
def test_from_response_formid_notexist(self):
def test_from_response_formid_nonexistent(self):
response = _buildresponse(
"""<form id="form1" action="post.php" method="POST">
<input type="hidden" name="one" value="1">

View File

@ -518,7 +518,7 @@ class TextResponseTest(BaseResponseTest):
def test_bom_is_removed_from_body(self):
# Inferring encoding from body also cache decoded body as sideeffect,
# this test tries to ensure that calling response.encoding and
# response.text in indistint order doesn't affect final
# response.text in indistinct order doesn't affect final
# values for encoding and decoded body.
url = "http://example.com"
body = b"\xef\xbb\xbfWORD"
@ -645,6 +645,7 @@ class TextResponseTest(BaseResponseTest):
"http://example.com/sample2.html",
"http://example.com/sample3.html",
"http://example.com/sample3.html",
"http://example.com/sample3.html",
"http://example.com/sample3.html#foo",
"http://www.google.com/something",
"http://example.com/innertag.html",

View File

@ -74,6 +74,10 @@ class Base:
url="http://example.com/sample3.html",
text="sample 3 repetition",
),
Link(
url="http://example.com/sample3.html",
text="sample 3 repetition",
),
Link(
url="http://example.com/sample3.html#foo",
text="sample 3 repetition with fragment",
@ -93,6 +97,10 @@ class Base:
url="http://example.com/sample3.html",
text="sample 3 repetition",
),
Link(
url="http://example.com/sample3.html",
text="sample 3 repetition",
),
Link(
url="http://example.com/sample3.html",
text="sample 3 repetition with fragment",

View File

@ -225,8 +225,8 @@ class ImagesPipelineTestCase(unittest.TestCase):
self.assertEqual(buf.getvalue(), thumb_buf.getvalue())
expected_warning_msg = (
".convert_image() method overriden in a deprecated way, "
"overriden method does not accept response_body argument."
".convert_image() method overridden in a deprecated way, "
"overridden method does not accept response_body argument."
)
self.assertEqual(
len(
@ -244,7 +244,7 @@ class ImagesPipelineTestCase(unittest.TestCase):
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
SIZE = (100, 100)
# straigh forward case: RGB and JPEG
# straight forward case: RGB and JPEG
COLOUR = (0, 127, 255)
im, _ = _create_image("JPEG", "RGB", SIZE, COLOUR)
converted, _ = self.pipeline.convert_image(im)
@ -271,7 +271,7 @@ class ImagesPipelineTestCase(unittest.TestCase):
self.assertEqual(converted.mode, "RGB")
self.assertEqual(converted.getcolors(), [(10000, (205, 230, 255))])
# ensure that we recieved deprecation warnings
# ensure that we received deprecation warnings
expected_warning_msg = ".convert_image() method called in a deprecated way"
self.assertTrue(
len(
@ -287,7 +287,7 @@ class ImagesPipelineTestCase(unittest.TestCase):
def test_convert_image_new(self):
# tests for new API
SIZE = (100, 100)
# straigh forward case: RGB and JPEG
# straight forward case: RGB and JPEG
COLOUR = (0, 127, 255)
im, buf = _create_image("JPEG", "RGB", SIZE, COLOUR)
converted, converted_buf = self.pipeline.convert_image(im, response_body=buf)

View File

@ -11,12 +11,12 @@ from tests.mockserver import MockServer
from tests.spiders import SingleRequestSpider
OVERRIDEN_URL = "https://example.org"
OVERRIDDEN_URL = "https://example.org"
class ProcessResponseMiddleware:
def process_response(self, request, response, spider):
return response.replace(request=Request(OVERRIDEN_URL))
return response.replace(request=Request(OVERRIDDEN_URL))
class RaiseExceptionRequestMiddleware:
@ -30,7 +30,7 @@ class CatchExceptionOverrideRequestMiddleware:
return Response(
url="http://localhost/",
body=b"Caught " + exception.__class__.__name__.encode("utf-8"),
request=Request(OVERRIDEN_URL),
request=Request(OVERRIDDEN_URL),
)
@ -52,7 +52,7 @@ class AlternativeCallbacksSpider(SingleRequestSpider):
class AlternativeCallbacksMiddleware:
def process_response(self, request, response, spider):
new_request = request.replace(
url=OVERRIDEN_URL,
url=OVERRIDDEN_URL,
callback=spider.alt_callback,
cb_kwargs={"foo": "bar"},
)
@ -132,16 +132,16 @@ class CrawlTestCase(TestCase):
yield crawler.crawl(seed=url, mockserver=self.mockserver)
response = crawler.spider.meta["responses"][0]
self.assertEqual(response.request.url, OVERRIDEN_URL)
self.assertEqual(response.request.url, OVERRIDDEN_URL)
self.assertEqual(signal_params["response"].url, url)
self.assertEqual(signal_params["request"].url, OVERRIDEN_URL)
self.assertEqual(signal_params["request"].url, OVERRIDDEN_URL)
log.check_present(
(
"scrapy.core.engine",
"DEBUG",
f"Crawled (200) <GET {OVERRIDEN_URL}> (referer: None)",
f"Crawled (200) <GET {OVERRIDDEN_URL}> (referer: None)",
),
)
@ -166,7 +166,7 @@ class CrawlTestCase(TestCase):
yield crawler.crawl(seed=url, mockserver=self.mockserver)
response = crawler.spider.meta["responses"][0]
self.assertEqual(response.body, b"Caught ZeroDivisionError")
self.assertEqual(response.request.url, OVERRIDEN_URL)
self.assertEqual(response.request.url, OVERRIDDEN_URL)
@defer.inlineCallbacks
def test_downloader_middleware_do_not_override_in_process_exception(self):

View File

@ -227,7 +227,7 @@ class MixinSameOrigin:
),
("http://example.com:81/page.html", "http://example.com/not-page.html", None),
("http://example.com/page.html", "http://example.com:81/not-page.html", None),
# Different protocols: do NOT send refferer
# Different protocols: do NOT send referrer
("https://example.com/page.html", "http://example.com/not-page.html", None),
("https://example.com/page.html", "http://not.example.com/", None),
("ftps://example.com/urls.zip", "https://example.com/not-page.html", None),
@ -750,19 +750,19 @@ class TestRequestMetaUnsafeUrl(MixinUnsafeUrl, TestRefererMiddleware):
req_meta = {"referrer_policy": POLICY_UNSAFE_URL}
class TestRequestMetaPredecence001(MixinUnsafeUrl, TestRefererMiddleware):
class TestRequestMetaPrecedence001(MixinUnsafeUrl, TestRefererMiddleware):
settings = {"REFERRER_POLICY": "scrapy.spidermiddlewares.referer.SameOriginPolicy"}
req_meta = {"referrer_policy": POLICY_UNSAFE_URL}
class TestRequestMetaPredecence002(MixinNoReferrer, TestRefererMiddleware):
class TestRequestMetaPrecedence002(MixinNoReferrer, TestRefererMiddleware):
settings = {
"REFERRER_POLICY": "scrapy.spidermiddlewares.referer.NoReferrerWhenDowngradePolicy"
}
req_meta = {"referrer_policy": POLICY_NO_REFERRER}
class TestRequestMetaPredecence003(MixinUnsafeUrl, TestRefererMiddleware):
class TestRequestMetaPrecedence003(MixinUnsafeUrl, TestRefererMiddleware):
settings = {
"REFERRER_POLICY": "scrapy.spidermiddlewares.referer.OriginWhenCrossOriginPolicy"
}
@ -888,19 +888,19 @@ class TestSettingsPolicyByName(TestCase):
RefererMiddleware(settings)
class TestPolicyHeaderPredecence001(MixinUnsafeUrl, TestRefererMiddleware):
class TestPolicyHeaderPrecedence001(MixinUnsafeUrl, TestRefererMiddleware):
settings = {"REFERRER_POLICY": "scrapy.spidermiddlewares.referer.SameOriginPolicy"}
resp_headers = {"Referrer-Policy": POLICY_UNSAFE_URL.upper()}
class TestPolicyHeaderPredecence002(MixinNoReferrer, TestRefererMiddleware):
class TestPolicyHeaderPrecedence002(MixinNoReferrer, TestRefererMiddleware):
settings = {
"REFERRER_POLICY": "scrapy.spidermiddlewares.referer.NoReferrerWhenDowngradePolicy"
}
resp_headers = {"Referrer-Policy": POLICY_NO_REFERRER.swapcase()}
class TestPolicyHeaderPredecence003(
class TestPolicyHeaderPrecedence003(
MixinNoReferrerWhenDowngrade, TestRefererMiddleware
):
settings = {
@ -909,7 +909,7 @@ class TestPolicyHeaderPredecence003(
resp_headers = {"Referrer-Policy": POLICY_NO_REFERRER_WHEN_DOWNGRADE.title()}
class TestPolicyHeaderPredecence004(
class TestPolicyHeaderPrecedence004(
MixinNoReferrerWhenDowngrade, TestRefererMiddleware
):
"""

View File

@ -15,6 +15,11 @@ class AsyncioTest(TestCase):
)
def test_install_asyncio_reactor(self):
from twisted.internet import reactor as original_reactor
with warnings.catch_warnings(record=True) as w:
install_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")
self.assertEqual(len(w), 0)
from twisted.internet import reactor
assert original_reactor == reactor

View File

@ -74,7 +74,7 @@ class WarnWhenSubclassedTest(unittest.TestCase):
self.assertIn("foo.NewClass", str(w[1].message))
self.assertIn("bar.OldClass", str(w[1].message))
def test_subclassing_warns_only_on_direct_childs(self):
def test_subclassing_warns_only_on_direct_children(self):
Deprecated = create_deprecated_class(
"Deprecated", NewName, warn_once=False, warn_category=MyWarning
)

View File

@ -7,17 +7,27 @@ from scrapy.utils.display import pformat, pprint
class TestDisplay(TestCase):
object = {"a": 1}
colorized_string = (
"{\x1b[33m'\x1b[39;49;00m\x1b[33ma\x1b[39;49;00m\x1b[33m'"
"\x1b[39;49;00m: \x1b[34m1\x1b[39;49;00m}\n"
)
colorized_strings = {
(
(
"{\x1b[33m'\x1b[39;49;00m\x1b[33ma\x1b[39;49;00m\x1b[33m'"
"\x1b[39;49;00m: \x1b[34m1\x1b[39;49;00m}"
)
+ suffix
)
for suffix in (
# https://github.com/pygments/pygments/issues/2313
"\n", # pygments ≤ 2.13
"\x1b[37m\x1b[39;49;00m\n", # pygments ≥ 2.14
)
}
plain_string = "{'a': 1}"
@mock.patch("sys.platform", "linux")
@mock.patch("sys.stdout.isatty")
def test_pformat(self, isatty):
isatty.return_value = True
self.assertEqual(pformat(self.object), self.colorized_string)
self.assertIn(pformat(self.object), self.colorized_strings)
@mock.patch("sys.stdout.isatty")
def test_pformat_dont_colorize(self, isatty):
@ -33,7 +43,7 @@ class TestDisplay(TestCase):
def test_pformat_old_windows(self, isatty, version):
isatty.return_value = True
version.return_value = "10.0.14392"
self.assertEqual(pformat(self.object), self.colorized_string)
self.assertIn(pformat(self.object), self.colorized_strings)
@mock.patch("sys.platform", "win32")
@mock.patch("scrapy.utils.display._enable_windows_terminal_processing")
@ -55,7 +65,7 @@ class TestDisplay(TestCase):
isatty.return_value = True
version.return_value = "10.0.14393"
terminal_processing.return_value = True
self.assertEqual(pformat(self.object), self.colorized_string)
self.assertIn(pformat(self.object), self.colorized_strings)
@mock.patch("sys.platform", "linux")
@mock.patch("sys.stdout.isatty")

View File

@ -159,7 +159,7 @@ class UtilsPythonTestCase(unittest.TestCase):
b = Obj()
# no attributes given return False
self.assertFalse(equal_attributes(a, b, []))
# not existent attributes
# nonexistent attributes
self.assertFalse(equal_attributes(a, b, ["x", "y"]))
a.x = 1

16
tox.ini
View File

@ -32,7 +32,7 @@ download = true
commands =
pytest --cov=scrapy --cov-report=xml --cov-report= {posargs:--durations=10 docs scrapy tests}
install_command =
pip install -U -ctests/upper-constraints.txt {opts} {packages}
python -I -m pip install -ctests/upper-constraints.txt {opts} {packages}
[testenv:typing]
basepython = python3
@ -63,8 +63,7 @@ commands =
flake8 {posargs:docs scrapy tests}
[testenv:pylint]
# reppy does not support Python 3.9+
basepython = python3.8
basepython = python3
deps =
{[testenv:extra-deps]deps}
pylint==2.15.6
@ -75,13 +74,14 @@ commands =
basepython = python3
deps =
twine==4.0.1
build==0.9.0
commands =
python setup.py sdist
python -m build --sdist
twine check dist/*
[pinned]
deps =
cryptography==3.3
cryptography==3.4.6
cssselect==0.9.1
h2==3.0
itemadapter==0.1.0
@ -106,7 +106,7 @@ deps =
setenv =
_SCRAPY_PINNED=true
install_command =
pip install -U {opts} {packages}
python -I -m pip install {opts} {packages}
[testenv:pinned]
deps =
@ -126,8 +126,7 @@ setenv =
{[pinned]setenv}
[testenv:extra-deps]
# reppy does not support Python 3.9+
basepython = python3.8
basepython = python3
deps =
{[testenv]deps}
boto
@ -135,7 +134,6 @@ deps =
# Twisted[http2] currently forces old mitmproxy because of h2 version
# restrictions in their deps, so we need to pin old markupsafe here too.
markupsafe < 2.1.0
reppy
robotexclusionrulesparser
Pillow>=4.0.0
Twisted[http2]>=17.9.0