1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-25 10:24:14 +00:00

1975 Commits

Author SHA1 Message Date
Rolando Espinoza La fuente
35a7059636 cleanup and refactor of parse & fetch commands
* removed scrapy.utils.fetch
 * each command schedule requests and start scrapy engine
 * fetch command instance BaseSpider if given url does not match any spider or match more than one
 * parse command schedule url if one spider matches
 * parse and fetch doesn't support multiple urls as parameter
 * force spider behavior --spider moved from BaseCommand to only commands: fetch, parse, crawl
2010-04-01 17:16:38 -03:00
Rolando Espinoza La fuente
dd477914db spidermanager refactoring
* Implements find/create method in Spider Manager API, removed fromdomain and fromurl

    This method is now in charge of spider resolution, it must return spider object
    from its argument or raise KeyError if no spider is found.

    This method obsoletes from_domain and from_url methods.

    The default implementation of resolve only searches against spider.name, it
    won't use spider.allowed_domains like the old fromdomain. This is the reason
    of why you must supply a spider if you want to crawl an url.

    Find methods returns only available spider names. Not spider instances.
    If no spider found returns empty list.

Affected modules:
    * command.models (force_domain)
        * removed spiders.force_domain
    * each command pass spider to crawl_* commands
    * command.commands.*
        * crawl
            * set spider from opts.spider if arg is url
            * group urls by spider to instance spider just once
        * genspider
            * use spiders.create() to check spider id
        * parse
            * log error if more than one spider found
    * core.manager
        * on crawl_* log message if multiple spiders found for url or request
    * shell
        * prints "Multiple found" if more than one spider found for url or request
        * populate_vars(): added spider keyword parameter

    * contrib.spidermanager:
        * removed fromdomain() & fromurl()
        * new create(spider_id) -> Spider. Raises KeyError if spider not found
        * new find_by_request(request) -> list(spiders)
2010-04-01 17:16:38 -03:00
Rolando Espinoza La fuente
8db67b17a3 scrapy manager refactor
* ExecutionManager
    * deprecated runonce(*args)
    * changed start() to start(keep_alive=Bool)
    * changed crawl(*args) to crawl(requests, spider=None)
        * if no spider given, tries to resolve spider
          for each request
    * added crawl_url(url, spider=None)
    * added crawl_request(request, spider=None)
    * added crawl_domain(domain)
    * added crawl_spider(spider)
 * updated commands: crawl, runspider, start
 * updated webconsole
 * updated crawler
 * updated tests.test_engine
 * updated utils.fetch
2010-04-01 17:16:38 -03:00
Pablo Hoffman
32f9c5fe68 removed old untested (and probably broken) code 2010-04-01 04:05:53 -03:00
Pablo Hoffman
4dc886e319 Improved comment 2010-03-31 18:26:35 -03:00
Pablo Hoffman
83d5eff0b7 More refactoring to encoding handling in TextResponse and subclasses 2010-03-31 18:21:41 -03:00
Pablo Hoffman
de896fa62d Refactored implementation of Request.replace() and Response.replace() 2010-03-31 16:29:53 -03:00
Pablo Hoffman
2ed8a5bfb5 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-03-27 13:25:06 -03:00
Pablo Hoffman
2f75839e7a Ignore noisy Twisted deprecation warnings 2010-03-27 13:23:13 -03:00
Pablo Hoffman
2299deda66 updated wrong link in doc 2010-03-26 14:02:33 -03:00
Pablo Hoffman
7cf2f87e27 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-03-26 08:29:34 -03:00
Pablo Hoffman
f19c939925 fixed doc typo 2010-03-26 08:28:32 -03:00
Daniel Grana
996a1b3574 fix handling of relative base urls in get_base_url util
--HG--
extra : rebase_source : eb552219e6bf40bc0d2e35968c367105233b6ecc
2010-03-25 15:50:34 -03:00
Pablo Hoffman
1330697c3d Some improvements to Response encoding support:
* added encoding aliases, configurable through a new ENCODING_ALIASES setting
* Response.encoding now returns the real encoding detected for the body
* simplified TextResponse API by removing body_encoding() and
  headers_encoding() methods
* Response.encoding now tries to infer the encoding from the body always (it
  was done before only on HtmlResponse and TextResponse)
* removed scrapy.utils.encoding.add_encoding_alias() function
* updated implementation of scrapy.utils.response function to reflect these API
  changes
* updated documentation to reflect API changes
2010-03-25 15:47:10 -03:00
Daniel Grana
173e94386b Support relative url used in base tag. closes #148
--HG--
extra : rebase_source : 1bff87c127a7e9d8d12c772b3068feb11eb5d97f
2010-03-25 12:38:37 -03:00
Pablo Hoffman
9ddcd1095d sort setting alphabetically 2010-03-25 11:45:06 -03:00
Pablo Hoffman
cb49567ca6 Removed wrong line added in previous commit 2010-03-24 12:15:18 -03:00
Pablo Hoffman
45411926b5 Improved encoding support by explicitly passing encoding to all str_to_unicode() and unicode_to_str() calls 2010-03-24 12:14:07 -03:00
Pablo Hoffman
4fa833c849 Added LOG_ENCODING setting 2010-03-24 12:13:38 -03:00
Pablo Hoffman
87e68e7438 Made MailSender non IO-blocking, and improved MailSender documentation 2010-03-22 13:37:37 -03:00
Pablo Hoffman
1dfc79b5d0 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-03-20 20:48:11 -03:00
Pablo Hoffman
99a876754c Improved "What else?" section of "Scrapy at a glance" overview 2010-03-20 20:24:18 -03:00
Pablo Hoffman
264cd2e035 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-03-19 10:32:42 -03:00
Pablo Hoffman
234fd709ad fixed doc typo (thanks Victor) 2010-03-19 10:32:17 -03:00
Daniel Grana
184cf6684f Remove HttpException references from docs. Since 0.7, scrapy returns non-200 as Response objects and does not raise HttpException anymore 2010-03-18 10:05:33 -03:00
Pablo Hoffman
403a21ec74 removed obsolete scrapy.crawler module 2010-03-12 17:28:33 -02:00
Daniel Grana
17091902f3 Explicity say where to save item class in "Defining our item" section of tutorial 2010-03-12 14:12:49 -02:00
Pablo Hoffman
54ae2c36d0 better implementation of open_in_browser() tests 2010-03-12 10:19:50 -02:00
Pablo Hoffman
38a296aa2c Added tests to open_in_browser() function 2010-03-12 09:52:39 -02:00
Pablo Hoffman
2ab94d75e2 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-03-12 09:32:35 -02:00
Pablo Hoffman
c5cd8b9d3d Fixed bug in open_in_browser() function with Python 2.5 (closes #145). 2010-03-12 09:31:05 -02:00
Pablo Hoffman
39e4df0cff removed unmaintained (and untested) contrib_exp ShoveItemPipeline 2010-03-10 00:10:36 -02:00
Pablo Hoffman
a505a9d490 minor code refactoring on scrapy.command.cmdline module 2010-03-04 11:09:16 -02:00
Pablo Hoffman
4c1ec0c97e replaced hacky command_executed dict by standard signal 2010-03-04 10:58:18 -02:00
Pablo Hoffman
861f9691c7 removed partly-obsolete module scrapy.contrib.groupsettings 2010-03-04 10:40:41 -02:00
Pablo Hoffman
d12cd22d5e switched default scheduler order to DFO, which consumes less memory by default 2010-03-04 10:15:58 -02:00
Daniel Grana
700be3202b Automated merge with ssh://hg.scrapy.org/scrapy-0.8 2010-02-24 15:44:41 -02:00
Daniel Grana
2322322ee6 Add missing priority and errback arguments to Request.replace method signature 2010-02-24 15:43:09 -02:00
Pablo Hoffman
180c091fb2 Fixed encoding issue (reported in #135) when the encoding declared in the HTTP header is unknown. This is the patch proposed by Rolando, with an update to the Request/Response documentation. 2010-02-24 14:01:29 -02:00
Pablo Hoffman
bbef0fe870 Automated merge with http://hg.scrapy.org/users/rolando/scrapy/ 2010-02-20 11:12:37 -02:00
Rolando Espinoza La fuente
7b1ad321e3 examples/experimental: added imdb top movies spider 2010-02-19 21:31:17 -04:00
Pablo Hoffman
cb99edd153 simplified and improved AUTHORS file 2010-02-19 23:16:55 -02:00
Pablo Hoffman
a3d22c7240 Automated merge with http://hg.scrapy.org/scrapy-0.8/ 2010-02-19 23:11:24 -02:00
Pablo Hoffman
60961e5499 minor documentation fix (refs #135) 2010-02-19 23:09:48 -02:00
Pablo Hoffman
c1f8198639 Added RANDOMIZE_DOWNLOAD_DELAY setting 2010-02-19 21:53:18 -02:00
Rolando Espinoza La fuente
4a053a762f examples/experimental: added gooledir crawler 2010-02-19 18:28:16 -04:00
Rolando Espinoza La fuente
a6a3f085a7 docs: added crawlspider v2 outline documentation
Sign-Off: Rolando Espinoza La fuente
2010-02-19 18:22:38 -04:00
Rolando Espinoza La fuente
17d1543929 contrib_exp: added crawlspider v2 package + tests
Sign-Off: Rolando Espinoza La fuente
2010-02-19 18:19:01 -04:00
Rolando Espinoza La fuente
7ddd4441e3 utils.python: added equal_attributes() to compare two objects arbitrary attributes
Sign-Off: Rolando Espinoza La fuente
2010-02-19 17:57:48 -04:00
Rolando Espinoza La fuente
7235040936 merged upstream 2010-02-19 17:41:45 -04:00