mirror of
https://github.com/scrapy/scrapy.git
synced 2025-02-06 08:49:32 +00:00
Scrapinghub → Zyte
This commit is contained in:
parent
28262d4b24
commit
f30f53b3cc
4
AUTHORS
4
AUTHORS
@ -1,8 +1,8 @@
|
||||
Scrapy was brought to life by Shane Evans while hacking a scraping framework
|
||||
prototype for Mydeco (mydeco.com). It soon became maintained, extended and
|
||||
improved by Insophia (insophia.com), with the initial sponsorship of Mydeco to
|
||||
bootstrap the project. In mid-2011, Scrapinghub became the new official
|
||||
maintainer.
|
||||
bootstrap the project. In mid-2011, Scrapinghub (now Zyte) became the new
|
||||
official maintainer.
|
||||
|
||||
Here is the list of the primary authors & contributors:
|
||||
|
||||
|
@ -55,7 +55,7 @@ further defined and clarified by project maintainers.
|
||||
## Enforcement
|
||||
|
||||
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
||||
reported by contacting the project team at opensource@scrapinghub.com. All
|
||||
reported by contacting the project team at opensource@zyte.com. All
|
||||
complaints will be reviewed and investigated and will result in a response that
|
||||
is deemed necessary and appropriate to the circumstances. The project team is
|
||||
obligated to maintain confidentiality with regard to the reporter of an incident.
|
||||
|
@ -42,10 +42,11 @@ Scrapy is a fast high-level web crawling and web scraping framework, used to
|
||||
crawl websites and extract structured data from their pages. It can be used for
|
||||
a wide range of purposes, from data mining to monitoring and automated testing.
|
||||
|
||||
Scrapy is maintained by `Scrapinghub`_ and `many other contributors`_.
|
||||
Scrapy is maintained by Zyte_ (formerly Scrapinghub) and `many other
|
||||
contributors`_.
|
||||
|
||||
.. _many other contributors: https://github.com/scrapy/scrapy/graphs/contributors
|
||||
.. _Scrapinghub: https://www.scrapinghub.com/
|
||||
.. _Zyte: https://www.zyte.com/
|
||||
|
||||
Check the Scrapy homepage at https://scrapy.org for more information,
|
||||
including a list of features.
|
||||
@ -95,7 +96,7 @@ Please note that this project is released with a Contributor Code of Conduct
|
||||
(see https://github.com/scrapy/scrapy/blob/master/CODE_OF_CONDUCT.md).
|
||||
|
||||
By participating in this project you agree to abide by its terms.
|
||||
Please report unacceptable behavior to opensource@scrapinghub.com.
|
||||
Please report unacceptable behavior to opensource@zyte.com.
|
||||
|
||||
Companies using Scrapy
|
||||
======================
|
||||
|
@ -266,7 +266,6 @@ For details, see `Issue #2473 <https://github.com/scrapy/scrapy/issues/2473>`_.
|
||||
.. _setuptools: https://pypi.python.org/pypi/setuptools
|
||||
.. _homebrew: https://brew.sh/
|
||||
.. _zsh: https://www.zsh.org/
|
||||
.. _Scrapinghub: https://scrapinghub.com
|
||||
.. _Anaconda: https://docs.anaconda.com/anaconda/
|
||||
.. _Miniconda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
|
||||
.. _conda-forge: https://conda-forge.org/
|
||||
|
@ -14,7 +14,7 @@ spiders come in.
|
||||
Popular choices for deploying Scrapy spiders are:
|
||||
|
||||
* :ref:`Scrapyd <deploy-scrapyd>` (open source)
|
||||
* :ref:`Scrapy Cloud <deploy-scrapy-cloud>` (cloud-based)
|
||||
* :ref:`Zyte Scrapy Cloud <deploy-scrapy-cloud>` (cloud-based)
|
||||
|
||||
.. _deploy-scrapyd:
|
||||
|
||||
@ -32,28 +32,28 @@ Scrapyd is maintained by some of the Scrapy developers.
|
||||
|
||||
.. _deploy-scrapy-cloud:
|
||||
|
||||
Deploying to Scrapy Cloud
|
||||
=========================
|
||||
Deploying to Zyte Scrapy Cloud
|
||||
==============================
|
||||
|
||||
`Scrapy Cloud`_ is a hosted, cloud-based service by `Scrapinghub`_,
|
||||
the company behind Scrapy.
|
||||
`Zyte Scrapy Cloud`_ is a hosted, cloud-based service by Zyte_, the company
|
||||
behind Scrapy.
|
||||
|
||||
Scrapy Cloud removes the need to setup and monitor servers
|
||||
and provides a nice UI to manage spiders and review scraped items,
|
||||
logs and stats.
|
||||
Zyte Scrapy Cloud removes the need to setup and monitor servers and provides a
|
||||
nice UI to manage spiders and review scraped items, logs and stats.
|
||||
|
||||
To deploy spiders to Scrapy Cloud you can use the `shub`_ command line tool.
|
||||
Please refer to the `Scrapy Cloud documentation`_ for more information.
|
||||
To deploy spiders to Zyte Scrapy Cloud you can use the `shub`_ command line
|
||||
tool.
|
||||
Please refer to the `Zyte Scrapy Cloud documentation`_ for more information.
|
||||
|
||||
Scrapy Cloud is compatible with Scrapyd and one can switch between
|
||||
Zyte Scrapy Cloud is compatible with Scrapyd and one can switch between
|
||||
them as needed - the configuration is read from the ``scrapy.cfg`` file
|
||||
just like ``scrapyd-deploy``.
|
||||
|
||||
.. _Scrapyd: https://github.com/scrapy/scrapyd
|
||||
.. _Deploying your project: https://scrapyd.readthedocs.io/en/latest/deploy.html
|
||||
.. _Scrapy Cloud: https://scrapinghub.com/scrapy-cloud
|
||||
.. _Scrapyd: https://github.com/scrapy/scrapyd
|
||||
.. _scrapyd-client: https://github.com/scrapy/scrapyd-client
|
||||
.. _shub: https://doc.scrapinghub.com/shub.html
|
||||
.. _scrapyd-deploy documentation: https://scrapyd.readthedocs.io/en/latest/deploy.html
|
||||
.. _Scrapy Cloud documentation: https://doc.scrapinghub.com/scrapy-cloud.html
|
||||
.. _Scrapinghub: https://scrapinghub.com/
|
||||
.. _shub: https://shub.readthedocs.io/en/latest/
|
||||
.. _Zyte: https://zyte.com/
|
||||
.. _Zyte Scrapy Cloud: https://www.zyte.com/scrapy-cloud/
|
||||
.. _Zyte Scrapy Cloud documentation: https://docs.zyte.com/scrapy-cloud.html
|
||||
|
@ -101,7 +101,7 @@ instance, which can be accessed and used like this::
|
||||
class MySpider(scrapy.Spider):
|
||||
|
||||
name = 'myspider'
|
||||
start_urls = ['https://scrapinghub.com']
|
||||
start_urls = ['https://scrapy.org']
|
||||
|
||||
def parse(self, response):
|
||||
self.logger.info('Parse function called on %s', response.url)
|
||||
@ -117,7 +117,7 @@ Python logger you want. For example::
|
||||
class MySpider(scrapy.Spider):
|
||||
|
||||
name = 'myspider'
|
||||
start_urls = ['https://scrapinghub.com']
|
||||
start_urls = ['https://scrapy.org']
|
||||
|
||||
def parse(self, response):
|
||||
logger.info('Parse function called on %s', response.url)
|
||||
|
@ -63,7 +63,7 @@ project as example.
|
||||
process = CrawlerProcess(get_project_settings())
|
||||
|
||||
# 'followall' is the name of one of the spiders of the project.
|
||||
process.crawl('followall', domain='scrapinghub.com')
|
||||
process.crawl('followall', domain='scrapy.org')
|
||||
process.start() # the script will block here until the crawling is finished
|
||||
|
||||
There's another Scrapy utility that provides more control over the crawling
|
||||
@ -244,7 +244,7 @@ Here are some tips to keep in mind when dealing with these kinds of sites:
|
||||
super proxy that you can attach your own proxies to.
|
||||
* use a highly distributed downloader that circumvents bans internally, so you
|
||||
can just focus on parsing clean pages. One example of such downloaders is
|
||||
`Crawlera`_
|
||||
`Zyte Smart Proxy Manager`_
|
||||
|
||||
If you are still unable to prevent your bot getting banned, consider contacting
|
||||
`commercial support`_.
|
||||
@ -254,5 +254,5 @@ If you are still unable to prevent your bot getting banned, consider contacting
|
||||
.. _ProxyMesh: https://proxymesh.com/
|
||||
.. _Google cache: http://www.googleguide.com/cached_pages.html
|
||||
.. _testspiders: https://github.com/scrapinghub/testspiders
|
||||
.. _Crawlera: https://scrapinghub.com/crawlera
|
||||
.. _scrapoxy: https://scrapoxy.io/
|
||||
.. _Zyte Smart Proxy Manager: https://www.zyte.com/smart-proxy-manager/
|
||||
|
@ -464,10 +464,10 @@ effectively. If you are not much familiar with XPath yet,
|
||||
you may want to take a look first at this `XPath tutorial`_.
|
||||
|
||||
.. note::
|
||||
Some of the tips are based on `this post from ScrapingHub's blog`_.
|
||||
Some of the tips are based on `this post from Zyte's blog`_.
|
||||
|
||||
.. _`XPath tutorial`: http://www.zvon.org/comp/r/tut-XPath_1.html
|
||||
.. _`this post from ScrapingHub's blog`: https://blog.scrapinghub.com/2014/07/17/xpath-tips-from-the-web-scraping-trenches/
|
||||
.. _this post from Zyte's blog: https://www.zyte.com/blog/xpath-tips-from-the-web-scraping-trenches/
|
||||
|
||||
|
||||
.. _topics-selectors-relative-xpaths:
|
||||
|
@ -303,11 +303,14 @@ class ScrapyAgent:
|
||||
proxyHost = to_unicode(proxyHost)
|
||||
omitConnectTunnel = b'noconnect' in proxyParams
|
||||
if omitConnectTunnel:
|
||||
warnings.warn("Using HTTPS proxies in the noconnect mode is deprecated. "
|
||||
"If you use Crawlera, it doesn't require this mode anymore, "
|
||||
"so you should update scrapy-crawlera to 1.3.0+ "
|
||||
"and remove '?noconnect' from the Crawlera URL.",
|
||||
ScrapyDeprecationWarning)
|
||||
warnings.warn(
|
||||
"Using HTTPS proxies in the noconnect mode is deprecated. "
|
||||
"If you use Zyte Smart Proxy Manager (formerly Crawlera), "
|
||||
"it doesn't require this mode anymore, so you should "
|
||||
"update scrapy-crawlera to 1.3.0+ and remove '?noconnect' "
|
||||
"from the Zyte Smart Proxy Manager URL.",
|
||||
ScrapyDeprecationWarning,
|
||||
)
|
||||
if scheme == b'https' and not omitConnectTunnel:
|
||||
proxyAuth = request.headers.get(b'Proxy-Authorization', None)
|
||||
proxyConf = (proxyHost, proxyPort, proxyAuth)
|
||||
|
Loading…
x
Reference in New Issue
Block a user