1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-24 23:04:14 +00:00

2007 Commits

Author SHA1 Message Date
Daniel Grana
c925c9e9a0 Notify spider when requests are ignored by HttpErrorMiddleware, and generally when any call to process_spider_input raises an exception 2010-05-12 16:41:06 -03:00
Daniel Grana
d3ab3cf85c url_query_cleaner: cleanup and avoid rejoining key-sep-value to build the query again
--HG--
extra : rebase_source : 7c2648b6dd1c2253f1ec0f11d5e1f2ee25bd1273
2010-05-12 14:09:37 -03:00
Pablo Hoffman
3fb8058016 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-05-11 11:25:24 -03:00
Pablo Hoffman
7a55158fed fixed documentation bug (thanks rhill for reporting) 2010-05-11 11:25:03 -03:00
Pablo Hoffman
1750e233f7 moved import to top 2010-05-11 11:23:56 -03:00
Daniel Grana
ac646a3b47 url_query_cleaner: do not append ? if query is empty 2010-04-30 16:19:59 -03:00
Daniel Grana
3d731ba641 url_query_cleaner: add exclude and non-unique parameters support, also remove untested exception catching code and add missing tests 2010-04-30 09:41:11 -03:00
Daniel Grana
c0d45846b8 Automated merge with ssh://hg.scrapy.org/scrapy-0.8 2010-04-26 22:29:45 -03:00
Steven Almeroth
5d03405cac FormRequest.from_response doc fix. closes #155
--HG--
extra : rebase_source : d54979f6a15e5e997072dcbbc6d43b426189312b
2010-04-26 22:28:07 -03:00
Pablo Hoffman
81f6502e37 Automated merge with http://hg.scrapy.org/scrapy-0.8/ 2010-04-24 18:22:13 -03:00
Pablo Hoffman
2121a30c74 added note about installing Zope.Interface in windows platforms 2010-04-24 18:19:52 -03:00
Daniel Grana
658e6f15e9 Automated merge with ssh://hg.scrapy.org/scrapy-0.8 2010-04-18 23:44:59 -03:00
Daniel Grana
6c12106803 Remove shpinx warning introduced by shorter title overline 2010-04-18 23:42:56 -03:00
Lucian Ursu
2f8c052484 #154: Language fixes to the documentation 2010-04-18 23:39:54 -03:00
Pablo Hoffman
b94abf36a3 Added scrapy.utils.py26.json to use python2.6 json module when available, otherwise failback to simplejson module or scrapy.xlib.simplejson. This way we can always assume json and avoid conditional code. 2010-04-12 10:44:07 -03:00
Pablo Hoffman
cd6aa72d7f fixed import 2010-04-12 10:42:07 -03:00
Pablo Hoffman
025b34e122 bugfix for python < 2.6 2010-04-11 07:07:38 -03:00
Pablo Hoffman
650d1c4fbe moved copytree() function from utils.python to utils.py26 2010-04-11 03:47:48 -03:00
Pablo Hoffman
be45acd457 added scrapy.service and scrapy.tac for running from twistd 2010-04-11 03:37:08 -03:00
Daniel Grana
0dbb5d44ae images: avoid signing images based on spider name or request hostname, use request.meta instead 2010-04-09 14:16:00 -03:00
Daniel Grana
68a875edb0 update ENCODING_ALIASES setting default value in settings documentation topic 2010-04-07 10:54:54 -03:00
Daniel Grana
8b86e1d008 Minimize effect of http://bugs.python.org/issue8271 on TextResponses by changing str.decode errors policy by custom replace alike error handler 2010-04-07 00:29:53 -03:00
Pablo Hoffman
3fcd69c347 added a couple additional TwistedPluginSpiderManager tests 2010-04-06 10:55:21 -03:00
daniel
2cd591e8a7 add missing dropin.cache file required by default spidermanager tests 2010-04-06 07:22:50 +01:00
Daniel Grana
0b07742adb gb2312 and gbk encodings was superseded by gb18030 2010-04-05 15:07:43 -03:00
Pablo Hoffman
0dfec04439 made Spider name required again (do not default) 2010-04-05 12:34:29 -03:00
Daniel Grana
70ac6642d5 SEP-012: bugfix backward compatibility of Spider.domain_name and Spider.extra_domain_names
--HG--
extra : rebase_source : 66f779cddc6854092951078d443dbf9113f7576a
2010-04-05 12:09:43 -03:00
Pablo Hoffman
77a4d9aba9 use a default name for spiders constructed without names 2010-04-05 11:53:22 -03:00
Pablo Hoffman
c99e1af766 Added support for passing generic arguments to spider constructors (refs #152), extended Spider tests, added unittests for TwistedPluginSpiderManager 2010-04-05 11:27:19 -03:00
Pablo Hoffman
de32612c99 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-04-02 02:49:51 -03:00
Pablo Hoffman
dfdac356af added missing default values to file xporter doc 2010-04-02 02:49:18 -03:00
Rolando Espinoza La fuente
db5c3df679 SEP12 implementation
* Rename BaseSpider.domain_name to BaseSpider.name

    This patch implements the domain_name to name change in BaseSpider class and
    change all spider instantiations to use the new attribute.

  * Add allowed_domains to spider

    This patch implements the merging of spider.domain_name and
    spider.extra_domain_names in spider.allowed_domains for offsite checking
    purposes.

    Note that spider.domain_name is not touched by this patch, only not used.

  * Remove spider.domain_name references from scrapy.stats

    * Rename domain_stats to spider_stats in MemoryStatsCollector
    * Use ``spider`` instead of ``domain`` in SimpledbStatsCollector
    * Rename domain_stats_history table to spider_data_history and rename domain
    field to spider in MysqlStatsCollector

  * Refactor genspider command

    The new signature for genspider is: genspider [options] <domain_name>.

    Genspider uses domain_name for spider name and for the module name.

  * Remove spider.domain_name references

  * Update crawl command signature <spider|url>

  * docs: updated references to domain_name

  * examples/experimental: use spider.name

  * genspider: require <name> <domain>

  * spidermanager: renamed crawl_domain to crawl_spider_name

  * spiderctl: updated references of *domain* to spider

  * added backward compatiblity with legacy spider's attributes
    'domain_name' and 'extra_domain_names'
2010-04-01 18:27:22 -03:00
Rolando Espinoza La fuente
35a7059636 cleanup and refactor of parse & fetch commands
* removed scrapy.utils.fetch
 * each command schedule requests and start scrapy engine
 * fetch command instance BaseSpider if given url does not match any spider or match more than one
 * parse command schedule url if one spider matches
 * parse and fetch doesn't support multiple urls as parameter
 * force spider behavior --spider moved from BaseCommand to only commands: fetch, parse, crawl
2010-04-01 17:16:38 -03:00
Rolando Espinoza La fuente
dd477914db spidermanager refactoring
* Implements find/create method in Spider Manager API, removed fromdomain and fromurl

    This method is now in charge of spider resolution, it must return spider object
    from its argument or raise KeyError if no spider is found.

    This method obsoletes from_domain and from_url methods.

    The default implementation of resolve only searches against spider.name, it
    won't use spider.allowed_domains like the old fromdomain. This is the reason
    of why you must supply a spider if you want to crawl an url.

    Find methods returns only available spider names. Not spider instances.
    If no spider found returns empty list.

Affected modules:
    * command.models (force_domain)
        * removed spiders.force_domain
    * each command pass spider to crawl_* commands
    * command.commands.*
        * crawl
            * set spider from opts.spider if arg is url
            * group urls by spider to instance spider just once
        * genspider
            * use spiders.create() to check spider id
        * parse
            * log error if more than one spider found
    * core.manager
        * on crawl_* log message if multiple spiders found for url or request
    * shell
        * prints "Multiple found" if more than one spider found for url or request
        * populate_vars(): added spider keyword parameter

    * contrib.spidermanager:
        * removed fromdomain() & fromurl()
        * new create(spider_id) -> Spider. Raises KeyError if spider not found
        * new find_by_request(request) -> list(spiders)
2010-04-01 17:16:38 -03:00
Rolando Espinoza La fuente
8db67b17a3 scrapy manager refactor
* ExecutionManager
    * deprecated runonce(*args)
    * changed start() to start(keep_alive=Bool)
    * changed crawl(*args) to crawl(requests, spider=None)
        * if no spider given, tries to resolve spider
          for each request
    * added crawl_url(url, spider=None)
    * added crawl_request(request, spider=None)
    * added crawl_domain(domain)
    * added crawl_spider(spider)
 * updated commands: crawl, runspider, start
 * updated webconsole
 * updated crawler
 * updated tests.test_engine
 * updated utils.fetch
2010-04-01 17:16:38 -03:00
Pablo Hoffman
32f9c5fe68 removed old untested (and probably broken) code 2010-04-01 04:05:53 -03:00
Pablo Hoffman
4dc886e319 Improved comment 2010-03-31 18:26:35 -03:00
Pablo Hoffman
83d5eff0b7 More refactoring to encoding handling in TextResponse and subclasses 2010-03-31 18:21:41 -03:00
Pablo Hoffman
de896fa62d Refactored implementation of Request.replace() and Response.replace() 2010-03-31 16:29:53 -03:00
Pablo Hoffman
2ed8a5bfb5 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-03-27 13:25:06 -03:00
Pablo Hoffman
2f75839e7a Ignore noisy Twisted deprecation warnings 2010-03-27 13:23:13 -03:00
Pablo Hoffman
2299deda66 updated wrong link in doc 2010-03-26 14:02:33 -03:00
Pablo Hoffman
7cf2f87e27 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-03-26 08:29:34 -03:00
Pablo Hoffman
f19c939925 fixed doc typo 2010-03-26 08:28:32 -03:00
Daniel Grana
996a1b3574 fix handling of relative base urls in get_base_url util
--HG--
extra : rebase_source : eb552219e6bf40bc0d2e35968c367105233b6ecc
2010-03-25 15:50:34 -03:00
Pablo Hoffman
1330697c3d Some improvements to Response encoding support:
* added encoding aliases, configurable through a new ENCODING_ALIASES setting
* Response.encoding now returns the real encoding detected for the body
* simplified TextResponse API by removing body_encoding() and
  headers_encoding() methods
* Response.encoding now tries to infer the encoding from the body always (it
  was done before only on HtmlResponse and TextResponse)
* removed scrapy.utils.encoding.add_encoding_alias() function
* updated implementation of scrapy.utils.response function to reflect these API
  changes
* updated documentation to reflect API changes
2010-03-25 15:47:10 -03:00
Daniel Grana
173e94386b Support relative url used in base tag. closes #148
--HG--
extra : rebase_source : 1bff87c127a7e9d8d12c772b3068feb11eb5d97f
2010-03-25 12:38:37 -03:00
Pablo Hoffman
9ddcd1095d sort setting alphabetically 2010-03-25 11:45:06 -03:00
Pablo Hoffman
cb49567ca6 Removed wrong line added in previous commit 2010-03-24 12:15:18 -03:00
Pablo Hoffman
45411926b5 Improved encoding support by explicitly passing encoding to all str_to_unicode() and unicode_to_str() calls 2010-03-24 12:14:07 -03:00