1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-03-01 15:28:29 +00:00

1273 Commits

Author SHA1 Message Date
Pablo Hoffman
e5b99a56c4 Several core changes:
Execution Manager:

* added control_reactor argument to delegate external twisted
  reactor control (for example by twistd)
* now it loads spiders (if not already loaded)
* now it stars the log (if not already started)
* removed *args from configure() method
* removed **opts from runonce and start methods

Execution engine:

* added control_reactor argument to to delegate external twisted
  reactor control (for example by twistd)
* changed some functions and method names for clarity
* improve handling of exceptions in st() method
* regrouped close_domain, closed_domain, and _close_domain method
  for legibilty

Scheduler:

* replaced pending_domains_count (dict) by pending_domains (set)
* simplified some doc
2009-06-15 19:44:26 -03:00
Pablo Hoffman
3c919f2562 Several core changes:
Execution Manager:

* added control_reactor argument to delegate external twisted
  reactor control (for example by twistd)
* now it loads spiders (if not already loaded)
* now it stars the log (if not already started)
* removed *args from configure() method
* removed **opts from runonce and start methods

Execution engine:

* added control_reactor argument to to delegate external twisted
  reactor control (for example by twistd)
* changed some functions and method names for clarity
* improve handling of exceptions in st() method
* regrouped close_domain, closed_domain, and _close_domain method
  for legibilty

Scheduler:

* replaced pending_domains_count (dict) by pending_domains (set)
* simplified some doc
2009-06-15 19:40:56 -03:00
Pablo Hoffman
5e3ef5a2fd item pipeline: added check for domain not already closed 2009-06-15 18:59:40 -03:00
Pablo Hoffman
aeb9734a80 downloader: made log message visible only when debug_mode is on 2009-06-15 18:58:37 -03:00
Pablo Hoffman
ff76f46d5a removed noisy comment and moved import to the top 2009-06-15 18:55:09 -03:00
Pablo Hoffman
1d8cec63d1 scrapy.log: check if twisted log started before 2009-06-15 18:50:47 -03:00
daniel
a8d430b4dd httpcache: add domain to logging message 2009-06-15 12:35:42 -03:00
Pablo Hoffman
fd0e490157 added StatsMailer extension 2009-06-12 15:38:21 -03:00
Pablo Hoffman
7c2476bb25 fixed a couple of bugs caused by adding priority to Requests (thanks Artem for reporting) 2009-06-12 08:31:30 -03:00
Pablo Hoffman
4a1a01354b Added 'priority' attribute to Requests and removed old 'priority' argument passed through engine, scheduler and scheduler middleware calls 2009-06-11 22:25:47 -03:00
Pablo Hoffman
962dbeba88 fixed typo in docstring 2009-06-11 08:33:01 -03:00
Pablo Hoffman
e55158ebdd Merged olveyra's patch 2009-06-10 18:00:32 -03:00
Pablo Hoffman
635ac1ca64 Simplified domain prioritizers, so that they don't receive domains in the
constructor (domain prioritizers will be refactored later anyway) and
simplified Scrapy Manager code thanks to this.

Added make_request_from_url method to BaseSpider, splitting funtionality to
create requests from URLs which was previously done all in start_requests.
2009-06-10 14:21:36 -03:00
Pablo Hoffman
a74b0b1764 additional simplification of OffsiteMiddleware 2009-06-09 13:09:35 -03:00
Pablo Hoffman
eca05c9e12 OffsiteMiddleware: removed logging and simplified implementation 2009-06-09 12:37:15 -03:00
molveyra
6524def4b8 dont check guid in RobustScrapedItem.validate. Instead, raise
NotImplemented.
2009-06-04 10:44:40 -03:00
Daniel Grana
87fbc9c58c spidermw: add domain name to warning about missing callbacks in requests 2009-05-28 21:47:41 -03:00
Daniel Grana
727e67af5e spidermw: ignore and warn about requests without callback returned by spiders 2009-05-28 21:41:02 -03:00
Daniel Grana
cfafa01109 spidermw: check for __iter__ instead of trying to iter() that may cause that a string pass as iterable 2009-05-28 21:10:30 -03:00
Pablo Hoffman
0f690b03dc added deprecation warning to ErrorPages downloader middleware 2009-05-28 13:57:25 -03:00
Pablo Hoffman
1aac694343 updated settings doc 2009-05-28 13:52:56 -03:00
Pablo Hoffman
04e7f8f5f6 merged with Daniel's HttpException-removal branch 2009-05-28 13:45:26 -03:00
Daniel Grana
abda5edf09 decompressionmw: dont try to do decompress empty responses 2009-05-28 09:31:43 -03:00
Daniel Grana
85dbdf5789 finally remove HttpException
in this changeset:
* remove HttpException from engine and core exceptions
* replace dwmw ErrorPages with spidermw HttpError
* bugfix image pipeline media_to_download method when stat_key returns None
2009-05-28 09:30:31 -03:00
Daniel Grana
0e5bea67fd images: adapt images pipeline to recent changes on HttpException topic 2009-05-28 00:27:42 -03:00
Daniel Grana
7eaa3ed24d stop raising HttpException at download handlers and adapt download middlewares 2009-05-27 16:51:36 -03:00
Daniel Grana
c8827552b6 fix typo at WEBCONSOLE_ENABLED setting documentaion of default value. thanks dzen 2009-05-26 15:48:34 -03:00
Pablo Hoffman
89950af834 cluster: fixed KeyError when crawler process failed to start 2009-05-25 23:45:10 -03:00
Pablo Hoffman
6d1ffa7137 renamed CrawlDebug downloader middleware to DebugMiddleware 2009-05-25 20:14:50 -03:00
Pablo Hoffman
b1dad251ae Deprecated Common Downloader Middleware and added DefaultHeaders Downloader
Middleware
2009-05-25 14:41:06 -03:00
Pablo Hoffman
90d408b04f Some changes to HTTP cache middleware:
* documented
* moved from scrapy.contrib.downloadermiddleware.cache.CacheMiddleware to
  scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware
* settings prefix changed from CACHE2_ to HTTPCACHE_

--HG--
rename : scrapy/contrib/downloadermiddleware/cache.py => scrapy/contrib/downloadermiddleware/httpcache.py
2009-05-24 19:13:06 -03:00
Pablo Hoffman
19f2992b26 applied Patrick patch: test_storedb: add base class for both mysql tests 2009-05-23 18:31:54 -03:00
Daniel Grana
dae0b1973b aws: missing import 2009-05-22 13:21:46 -03:00
Daniel Grana
4efcf78a4a aws: take AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY from enviroment just like boto does 2009-05-22 13:14:16 -03:00
Ismael Carnales
3955844115 Removed FieldValueError in favour of ValueError 2009-05-21 15:01:48 +00:00
Ismael Carnales
c03e246002 Added DateTimeField 2009-05-21 14:57:52 +00:00
Ismael Carnales
d5f0cae776 New implementation of Field and MultiValuedField 2009-05-21 14:55:14 +00:00
Ismael Carnales
0cc289ac84 New and simpler implementation of BooleanField 2009-05-21 14:51:50 +00:00
Ismael Carnales
55d922a4b0 Fixed BooleanField default value 2009-05-21 14:50:35 +00:00
Ismael Carnales
1ffe64dab3 Added test for newitem fields 2009-05-21 14:48:43 +00:00
Pablo Hoffman
48bfd3fe4b renamed old setting 2009-05-20 02:15:31 -03:00
Pablo Hoffman
befd28eef4 docs/tutorial: added reminder about adding pipeline to ITEM_PIPELINES settings (thanks jamie) 2009-05-20 00:57:44 -03:00
Pablo Hoffman
04610a25dc fixed bug in tutorial regarding csv writer pipeline, and other minor corrections 2009-05-19 03:07:08 -03:00
Daniel Grana
abfc52cd17 docs: modify install document to mercurial based installation instructions 2009-05-19 01:50:44 -03:00
Pablo Hoffman
13bb9934f9 moved htmlparser and lxml based link extractors to scrapy.contrib.linkextractors, with the rest of the link extractors 2009-05-18 23:06:27 -03:00
Pablo Hoffman
c161c29e08 simplified some scrapy.log implementation code 2009-05-18 21:32:17 -03:00
Pablo Hoffman
a8a3de17ef removed unused line 2009-05-18 21:11:03 -03:00
Pablo Hoffman
b87734341d fixed docstring 2009-05-18 20:59:26 -03:00
Pablo Hoffman
59e504a003 removed code from scrapy.link to avoid cyclic imports from scrapy.contrib.linkextractors.sgml 2009-05-18 19:27:51 -03:00
Pablo Hoffman
86498abdf1 Sorted out Link Extractors organization by moving all them to
scrapy.contrib.linkextractors.

The most relevant being:
    scrapy.link.extractors.RegexLinkExtractor

which was moved to:
    scrapy.contrib.linkextractors.sgml.SgmlLinkExtractor

The old location still works but throws a deprecation warning. It will be
removed before the 0.7 release.

Documentation and tests were also updated.

Also, in this changeset, a new regex-based link extractor was added to
scrapy.contrib.linkextractors.regex.

--HG--
rename : scrapy/tests/sample_data/link_extractor/regex_linkextractor.html => scrapy/tests/sample_data/link_extractor/sgml_linkextractor.html
rename : scrapy/tests/test_link.py => scrapy/tests/test_contrib_linkextractors.py
2009-05-18 19:19:37 -03:00