1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-24 23:04:14 +00:00

1987 Commits

Author SHA1 Message Date
Daniel Grana
68a875edb0 update ENCODING_ALIASES setting default value in settings documentation topic 2010-04-07 10:54:54 -03:00
Daniel Grana
8b86e1d008 Minimize effect of http://bugs.python.org/issue8271 on TextResponses by changing str.decode errors policy by custom replace alike error handler 2010-04-07 00:29:53 -03:00
Pablo Hoffman
3fcd69c347 added a couple additional TwistedPluginSpiderManager tests 2010-04-06 10:55:21 -03:00
daniel
2cd591e8a7 add missing dropin.cache file required by default spidermanager tests 2010-04-06 07:22:50 +01:00
Daniel Grana
0b07742adb gb2312 and gbk encodings was superseded by gb18030 2010-04-05 15:07:43 -03:00
Pablo Hoffman
0dfec04439 made Spider name required again (do not default) 2010-04-05 12:34:29 -03:00
Daniel Grana
70ac6642d5 SEP-012: bugfix backward compatibility of Spider.domain_name and Spider.extra_domain_names
--HG--
extra : rebase_source : 66f779cddc6854092951078d443dbf9113f7576a
2010-04-05 12:09:43 -03:00
Pablo Hoffman
77a4d9aba9 use a default name for spiders constructed without names 2010-04-05 11:53:22 -03:00
Pablo Hoffman
c99e1af766 Added support for passing generic arguments to spider constructors (refs #152), extended Spider tests, added unittests for TwistedPluginSpiderManager 2010-04-05 11:27:19 -03:00
Pablo Hoffman
de32612c99 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-04-02 02:49:51 -03:00
Pablo Hoffman
dfdac356af added missing default values to file xporter doc 2010-04-02 02:49:18 -03:00
Rolando Espinoza La fuente
db5c3df679 SEP12 implementation
* Rename BaseSpider.domain_name to BaseSpider.name

    This patch implements the domain_name to name change in BaseSpider class and
    change all spider instantiations to use the new attribute.

  * Add allowed_domains to spider

    This patch implements the merging of spider.domain_name and
    spider.extra_domain_names in spider.allowed_domains for offsite checking
    purposes.

    Note that spider.domain_name is not touched by this patch, only not used.

  * Remove spider.domain_name references from scrapy.stats

    * Rename domain_stats to spider_stats in MemoryStatsCollector
    * Use ``spider`` instead of ``domain`` in SimpledbStatsCollector
    * Rename domain_stats_history table to spider_data_history and rename domain
    field to spider in MysqlStatsCollector

  * Refactor genspider command

    The new signature for genspider is: genspider [options] <domain_name>.

    Genspider uses domain_name for spider name and for the module name.

  * Remove spider.domain_name references

  * Update crawl command signature <spider|url>

  * docs: updated references to domain_name

  * examples/experimental: use spider.name

  * genspider: require <name> <domain>

  * spidermanager: renamed crawl_domain to crawl_spider_name

  * spiderctl: updated references of *domain* to spider

  * added backward compatiblity with legacy spider's attributes
    'domain_name' and 'extra_domain_names'
2010-04-01 18:27:22 -03:00
Rolando Espinoza La fuente
35a7059636 cleanup and refactor of parse & fetch commands
* removed scrapy.utils.fetch
 * each command schedule requests and start scrapy engine
 * fetch command instance BaseSpider if given url does not match any spider or match more than one
 * parse command schedule url if one spider matches
 * parse and fetch doesn't support multiple urls as parameter
 * force spider behavior --spider moved from BaseCommand to only commands: fetch, parse, crawl
2010-04-01 17:16:38 -03:00
Rolando Espinoza La fuente
dd477914db spidermanager refactoring
* Implements find/create method in Spider Manager API, removed fromdomain and fromurl

    This method is now in charge of spider resolution, it must return spider object
    from its argument or raise KeyError if no spider is found.

    This method obsoletes from_domain and from_url methods.

    The default implementation of resolve only searches against spider.name, it
    won't use spider.allowed_domains like the old fromdomain. This is the reason
    of why you must supply a spider if you want to crawl an url.

    Find methods returns only available spider names. Not spider instances.
    If no spider found returns empty list.

Affected modules:
    * command.models (force_domain)
        * removed spiders.force_domain
    * each command pass spider to crawl_* commands
    * command.commands.*
        * crawl
            * set spider from opts.spider if arg is url
            * group urls by spider to instance spider just once
        * genspider
            * use spiders.create() to check spider id
        * parse
            * log error if more than one spider found
    * core.manager
        * on crawl_* log message if multiple spiders found for url or request
    * shell
        * prints "Multiple found" if more than one spider found for url or request
        * populate_vars(): added spider keyword parameter

    * contrib.spidermanager:
        * removed fromdomain() & fromurl()
        * new create(spider_id) -> Spider. Raises KeyError if spider not found
        * new find_by_request(request) -> list(spiders)
2010-04-01 17:16:38 -03:00
Rolando Espinoza La fuente
8db67b17a3 scrapy manager refactor
* ExecutionManager
    * deprecated runonce(*args)
    * changed start() to start(keep_alive=Bool)
    * changed crawl(*args) to crawl(requests, spider=None)
        * if no spider given, tries to resolve spider
          for each request
    * added crawl_url(url, spider=None)
    * added crawl_request(request, spider=None)
    * added crawl_domain(domain)
    * added crawl_spider(spider)
 * updated commands: crawl, runspider, start
 * updated webconsole
 * updated crawler
 * updated tests.test_engine
 * updated utils.fetch
2010-04-01 17:16:38 -03:00
Pablo Hoffman
32f9c5fe68 removed old untested (and probably broken) code 2010-04-01 04:05:53 -03:00
Pablo Hoffman
4dc886e319 Improved comment 2010-03-31 18:26:35 -03:00
Pablo Hoffman
83d5eff0b7 More refactoring to encoding handling in TextResponse and subclasses 2010-03-31 18:21:41 -03:00
Pablo Hoffman
de896fa62d Refactored implementation of Request.replace() and Response.replace() 2010-03-31 16:29:53 -03:00
Pablo Hoffman
2ed8a5bfb5 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-03-27 13:25:06 -03:00
Pablo Hoffman
2f75839e7a Ignore noisy Twisted deprecation warnings 2010-03-27 13:23:13 -03:00
Pablo Hoffman
2299deda66 updated wrong link in doc 2010-03-26 14:02:33 -03:00
Pablo Hoffman
7cf2f87e27 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-03-26 08:29:34 -03:00
Pablo Hoffman
f19c939925 fixed doc typo 2010-03-26 08:28:32 -03:00
Daniel Grana
996a1b3574 fix handling of relative base urls in get_base_url util
--HG--
extra : rebase_source : eb552219e6bf40bc0d2e35968c367105233b6ecc
2010-03-25 15:50:34 -03:00
Pablo Hoffman
1330697c3d Some improvements to Response encoding support:
* added encoding aliases, configurable through a new ENCODING_ALIASES setting
* Response.encoding now returns the real encoding detected for the body
* simplified TextResponse API by removing body_encoding() and
  headers_encoding() methods
* Response.encoding now tries to infer the encoding from the body always (it
  was done before only on HtmlResponse and TextResponse)
* removed scrapy.utils.encoding.add_encoding_alias() function
* updated implementation of scrapy.utils.response function to reflect these API
  changes
* updated documentation to reflect API changes
2010-03-25 15:47:10 -03:00
Daniel Grana
173e94386b Support relative url used in base tag. closes #148
--HG--
extra : rebase_source : 1bff87c127a7e9d8d12c772b3068feb11eb5d97f
2010-03-25 12:38:37 -03:00
Pablo Hoffman
9ddcd1095d sort setting alphabetically 2010-03-25 11:45:06 -03:00
Pablo Hoffman
cb49567ca6 Removed wrong line added in previous commit 2010-03-24 12:15:18 -03:00
Pablo Hoffman
45411926b5 Improved encoding support by explicitly passing encoding to all str_to_unicode() and unicode_to_str() calls 2010-03-24 12:14:07 -03:00
Pablo Hoffman
4fa833c849 Added LOG_ENCODING setting 2010-03-24 12:13:38 -03:00
Pablo Hoffman
87e68e7438 Made MailSender non IO-blocking, and improved MailSender documentation 2010-03-22 13:37:37 -03:00
Pablo Hoffman
1dfc79b5d0 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-03-20 20:48:11 -03:00
Pablo Hoffman
99a876754c Improved "What else?" section of "Scrapy at a glance" overview 2010-03-20 20:24:18 -03:00
Pablo Hoffman
264cd2e035 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-03-19 10:32:42 -03:00
Pablo Hoffman
234fd709ad fixed doc typo (thanks Victor) 2010-03-19 10:32:17 -03:00
Daniel Grana
184cf6684f Remove HttpException references from docs. Since 0.7, scrapy returns non-200 as Response objects and does not raise HttpException anymore 2010-03-18 10:05:33 -03:00
Pablo Hoffman
403a21ec74 removed obsolete scrapy.crawler module 2010-03-12 17:28:33 -02:00
Daniel Grana
17091902f3 Explicity say where to save item class in "Defining our item" section of tutorial 2010-03-12 14:12:49 -02:00
Pablo Hoffman
54ae2c36d0 better implementation of open_in_browser() tests 2010-03-12 10:19:50 -02:00
Pablo Hoffman
38a296aa2c Added tests to open_in_browser() function 2010-03-12 09:52:39 -02:00
Pablo Hoffman
2ab94d75e2 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-03-12 09:32:35 -02:00
Pablo Hoffman
c5cd8b9d3d Fixed bug in open_in_browser() function with Python 2.5 (closes #145). 2010-03-12 09:31:05 -02:00
Pablo Hoffman
39e4df0cff removed unmaintained (and untested) contrib_exp ShoveItemPipeline 2010-03-10 00:10:36 -02:00
Pablo Hoffman
a505a9d490 minor code refactoring on scrapy.command.cmdline module 2010-03-04 11:09:16 -02:00
Pablo Hoffman
4c1ec0c97e replaced hacky command_executed dict by standard signal 2010-03-04 10:58:18 -02:00
Pablo Hoffman
861f9691c7 removed partly-obsolete module scrapy.contrib.groupsettings 2010-03-04 10:40:41 -02:00
Pablo Hoffman
d12cd22d5e switched default scheduler order to DFO, which consumes less memory by default 2010-03-04 10:15:58 -02:00
Daniel Grana
700be3202b Automated merge with ssh://hg.scrapy.org/scrapy-0.8 2010-02-24 15:44:41 -02:00
Daniel Grana
2322322ee6 Add missing priority and errback arguments to Request.replace method signature 2010-02-24 15:43:09 -02:00