1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-03-01 16:27:44 +00:00

5111 Commits

Author SHA1 Message Date
Mikhail Korobov
1740fcf1a6 DOC SignalManager docstrings. See GH-713.
This change is not 100% backwards compatible because of *args changes.
Their usage was not documented, so we're not breaking public interface.
2015-06-08 21:05:58 +05:00
Mikhail Korobov
9a787893e3 (backwards-incompatible) allow to pass settings=None to configure_logging
* use explicit argument for disabling root handler;
* handle LOG_STDOUT even if install_root_handler is False
2015-06-08 19:54:18 +05:00
Mikhail Korobov
3cbf8a0b2b extract CrawlerRunner._crawl method which always expects Crawler
It provides an extension point where crawler instance is available;
it should make it easier to write alternative CrawlerRunner.crawl
implementations.

See also: https://github.com/scrapy/scrapy/pull/1256
2015-06-08 18:35:44 +05:00
Pawel Miech
e575f44446 [settings/default_settings.py] dont retry 400
As in HTTP specs:

"10.4.1 400 Bad Request

The request could not be understood by the server due to malformed
syntax. The client SHOULD NOT repeat the request without
modifications."

Scrapy should not retry 400 by default.
2015-06-08 10:52:42 +02:00
Daniel Graña
87293965db Merge pull request #1285 from scrapy/optional-settings-arguments
make it easier to use default settings
2015-06-07 20:29:24 -03:00
Chris Nilsson
61dec83f70 Moved default value of MEMUSAGE_CHECK_INTERVAL_SECONDS to default_settings 2015-06-06 11:19:29 +10:00
Chris Nilsson
0c532baf4c Removed typo, and clarified time unit of setting 2015-06-06 11:18:13 +10:00
Mikhail Korobov
d047665c02 make "settings" argument optional for Crawler, CrawlerRunner and CrawlerProcess 2015-06-06 03:23:13 +05:00
Mikhail Korobov
64399d18d8 Stop reactor on Ctrl-C regardless of 'stop_after_crawl'. Fixes GH-1279. 2015-06-06 02:53:36 +05:00
Mikhail Korobov
33d145e2f5 CrawlerProcess cleanup
* remove unneeded lambda;
* extract _get_dns_resolver method and format code to pep8.
2015-06-06 02:49:39 +05:00
Julia Medina
24d8a85269 Update release notes for 1.0.0rc2
(cherry picked from commit 6e61d54168cf471363be3e7e54d75ad544b9f6e1)
2015-06-05 17:11:40 -03:00
Chris Nilsson
eae25a04d9 Added MEMUSAGE_CHECK_INTERVAL_SECONDS to Memory usage extension options.
Kept the default as it was, at 60.0 seconds. But added a setting to
allow this to be changed as desired.
2015-06-06 00:39:14 +10:00
Daniel Graña
d9bcd48606 Merge pull request #1278 from Curita/remove-tz-aware-logformat
Remove deprecated %z formatting from the default LOG_DATEFORMAT
2015-06-04 13:39:01 -03:00
Julia Medina
367ea81e71 Remove deprecated %z formatting from the default LOG_DATEFORMAT 2015-06-04 04:11:23 +08:00
Mikhail Korobov
f312ffcb54 Merge pull request #1276 from scrapy/fix-spider-settings
Fix Spider.custom_settings
2015-06-03 22:14:04 +05:00
Mikhail Korobov
d42c420a6d fixed spider custom_settings
https://github.com/scrapy/scrapy/pull/1128 moved spidercls.update_settings
call to a later stage; this commit moves it back.
2015-06-03 04:29:10 +05:00
Mikhail Korobov
cc2f3e1b46 TST a test case to show custom_settings doesn't always work 2015-06-03 04:26:20 +05:00
Daniel Graña
d52cf8bb03 Merge pull request #1267 from Curita/fix-1265
Fix #1265
2015-06-01 20:31:46 -03:00
Julia Medina
ffc7b7fd6c Add helper to update deprecated class paths 2015-06-01 17:01:33 -03:00
Ally Weir
bd2fe996aa Spelling correction
incorrect use of "too" instead of "to"
2015-06-01 20:47:22 +05:00
Julia Medina
9d1cf230ed Merge pull request #1268 from scrapy/crawlerprocess-dict-settings
fixed CrawlerProcess when settings are passed as dicts
2015-06-01 12:35:54 -03:00
Marven Sanchez
8771d1f79b Update HTTPCache middleware docs 2015-06-01 18:20:59 +08:00
Marven Sanchez
bb3ebf13f9 Add tests for RFC2616 policy enhancements
Add `scrapy/downloadermiddlewares/httpcache.py` to `tests/py3-ignores.txt
2015-06-01 18:20:12 +08:00
Jamey Sharp
1991550442 Allow client to bound max-age for revalidation.
Unlike specifying "Cache-Control: no-cache", if the request specifies
"max-age=0", then the cached validators will be used if possible to
avoid re-fetching unchanged pages.

That said, it's still useful to be able to specify "no-cache" on the
request, in cases where the origin server may have changed page contents
without changing validators.
2015-06-01 18:06:36 +08:00
Jamey Sharp
c3b2cabf6c Allow setting RFC2616Policy to cache unconditionally.
A spider may wish to have all responses available in the cache, for
future use with "Cache-Control: max-stale", for instance. The
DummyPolicy caches all responses but never revalidates them, and
sometimes a more nuanced policy is desirable.

This setting still respects "Cache-Control: no-store" directives in
responses. If you don't want that, filter "no-store" out of the
Cache-Control headers in responses you feed to the cache middleware.
2015-06-01 18:06:35 +08:00
Jamey Sharp
e23a381337 Let spiders ignore bogus Cache-Control headers.
Sites often set "no-store", "no-cache", "must-revalidate", etc., but get
upset at the traffic a spider can generate if it respects those
directives.

Allow the spider's author to selectively ignore Cache-Control directives
that are known to be unimportant for the sites being crawled.

We assume that the spider will not issue Cache-Control directives in
requests unless it actually needs them, so directives in requests are
not filtered.
2015-06-01 18:06:35 +08:00
Jamey Sharp
dd3a46295c Support "Cache-Control: max-stale" in requests.
This allows spiders to be configured with the full RFC2616 cache policy,
but avoid revalidation on a request-by-request basis, while remaining
conformant with the HTTP spec.
2015-06-01 18:06:35 +08:00
Jamey Sharp
4446baae33 Use cached responses if revalidation errors out. 2015-06-01 18:06:35 +08:00
Mikhail Korobov
aa6a72707d fixed CrawlerProcess when settings are passed as dicts
See https://github.com/scrapy/scrapy/pull/1156
2015-05-30 06:59:15 +05:00
Mikhail Korobov
342cb622f1 DOC fix non-working link (by removing it).
See https://github.com/scrapy/scrapy/pull/1260
2015-05-27 23:04:58 +05:00
Julia Medina
343d20d791 Update 1.0 release notes 2015-05-27 11:53:54 -03:00
Julia Medina
62a6eff218 Merge pull request #1259 from chekunkov/log-counter-handler-is-never-removed
[MRG +1] LogCounterHandler is never removed from root handlers list, fix that
2015-05-27 11:42:19 -03:00
Julia Medina
26f50d3f43 Extend regex for tags that deploy to PyPI to support new release cycle 2015-05-27 09:17:18 -03:00
Alexander Chekunkov
b2765aabd8 LogCounterHandler is never removed from root handlers list, fix that
lambda is garbage collected and because receiver is added as weak reference by default - when signals.engine_stopped is fired logging.root.removeHandler is not executed. Fixed that by assigning lambda to a private argument and not by using connect(..., weak=False) because I belive this lambda function should be collected with crawler object
2015-05-27 13:52:47 +07:00
Daniel Graña
5ee08865d6 Merge pull request #1258 from chekunkov/crawler-process-stopping-is-no-more
[MRG+1] Remove CrawlerProcess.stopping as it isn't used any more
2015-05-26 15:32:24 -03:00
Alexander Chekunkov
b0ea3e38d1 remove CrawlerProcess.stopping as it isn't used any more 2015-05-26 17:37:16 +07:00
Pablo Hoffman
545c4224f9 update old crawlera link 2015-05-25 16:01:54 -03:00
Daniel Graña
ebe889a663 Unquote request path before passing to FTPClient, it already escape paths 2015-05-23 20:50:30 -03:00
Daniel Graña
3545468389 Merge branch 'deferdelay' 2015-05-23 18:09:20 -03:00
Daniel Graña
d439c26d76 update docstring and release notes 2015-05-22 20:00:58 -03:00
Alexey Vishnevsky
27ce3225bd Makes scrapy more async by letting to reactor spend another couple of cycles to accomplish its needs. 2015-05-22 17:05:19 -03:00
Julia Medina
4b2763c6f9 Bump version: 1.0.0rc1 → 1.1.0dev1 2015-05-22 13:24:50 -03:00
Julia Medina
de6d232a02 Bump version: 0.25.1 → 1.0.0rc1 1.0.0rc1 2015-05-22 13:24:27 -03:00
Julia Medina
29529e5e8e Merge pull request #1244 from Curita/1.0-release-notes
1.0 release notes
2015-05-22 13:21:17 -03:00
Julia Medina
600164594c New release cycle in .bumpversion.cfg
1.0.0dev1 -> 1.0.0rc1 -> 1.0.0 -> 1.1.0dev1 -> ...
2015-05-22 12:59:21 -03:00
Julia Medina
afcf70cdc6 Add 1.0 release notes 2015-05-22 12:53:11 -03:00
Mikhail Korobov
cc2258b2bb Merge pull request #1145 from bosnj/master
[MRG+1] default return value for extract_first
2015-05-21 22:03:54 +05:00
Daniel Graña
58717472f7 Merge pull request #1250 from chekunkov/scrapy-log-fix-incompatible-change
[MRG+1] Keep level_names in scrapy.log for backwards compatibility
2015-05-21 10:46:39 -03:00
Alexander Chekunkov
795ca3945f keep level_names in scrapy.log for backwards compatibility 2015-05-21 08:56:44 +00:00
Daniel Graña
ee59112480 Merge pull request #1224 from scrapy/fix-empty-feed-export-fields
[MRG] fixed FEED_EXPORT_FIELDS handling (see #1223)
2015-05-19 16:36:05 -03:00