1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-24 22:43:57 +00:00

63 Commits

Author SHA1 Message Date
grammy-jiang
cb76b88331 fix a mistake in topic spider-middleware.rst 2018-04-04 05:56:05 -04:00
Jesse Bakker
0b14cb44aa Added from_crawler to middleware docs 2017-11-23 15:25:43 +01:00
djunzu
8288f78a39 Add note about request.meta['depth'] in DepthMiddleware 2017-10-16 21:34:37 -02:00
Paul Tremberth
bc200d1155 Rename setting to REFERRER_POLICY (with 2 Rs) 2017-03-01 17:51:23 +01:00
Paul Tremberth
537683f945 Add autoclass directives to document built-in policies 2017-03-01 17:51:23 +01:00
Paul Tremberth
3dc09eeceb Use table for referrer policy options 2017-03-01 17:51:23 +01:00
Paul Tremberth
605935f015 Edit text 2017-03-01 17:51:23 +01:00
Paul Tremberth
eb07285a63 Reword warning on no-referrer-when-downgrade policy 2017-03-01 17:51:23 +01:00
Paul Tremberth
03ff19d188 Update docs for new "referrer_policy" Request.meta key 2017-03-01 17:51:23 +01:00
Paul Tremberth
e249abc32b Update docs 2017-03-01 17:50:39 +01:00
Paul Tremberth
c86f568b9c Update docs with "strict-..." policies 2017-03-01 17:50:39 +01:00
Paul Tremberth
c9c59db489 Update documentation about REFERER_POLICY setting 2017-03-01 17:50:39 +01:00
Takehiro Shiozaki
fcb3daf4fa fix typo 2017-02-06 14:03:41 +09:00
Jose Ricardo
e12e364a40 Add details to the spider middlewares docs
Document the effects of the middleware order in a more detailed way.
2016-10-18 12:29:30 -02:00
nyov
5876b9aa30 Update documentation links 2016-03-03 16:28:33 +00:00
Νικόλαος-Διγενής Καραγιάννης
1cffa99e0d tests+doc for subdomains in offsite middleware 2016-01-26 12:49:43 +02:00
Jakob de Maeyer
e66f649894 Bring back _BASE settings 2015-11-11 17:39:56 +01:00
Jakob de Maeyer
26586ef5a6 Deprecate _BASE settings, unify _BASE backwards-compatibility 2015-10-27 12:43:23 +01:00
Julia Medina
d3f576a816 Move scrapy/spider.py to scrapy/spiders/__init__.py 2015-05-09 04:20:09 -03:00
Julia Medina
180272c092 Move scrapy/contrib/spidermiddleware to scrapy/spidermiddlewares 2015-04-29 21:26:35 -03:00
Pablo Hoffman
bb4c922d85 Merge pull request #1081 from scrapy/dict-items
Allow spiders to return dicts.
2015-03-27 15:19:27 -03:00
Mikhail Korobov
817dbc6cbd DOC mention dicts in documentation; explain better what are Items for 2015-03-19 05:16:14 +05:00
Shadab Zafar
5a58d64131 Fix some redirection links in documentation
Fixes #606
2015-03-18 19:41:26 -03:00
Mikhail Korobov
baf5c59386 Merge pull request #1071 from eliasdorneles/updating-request-meta-special-keys
updating list of Request.meta special keys
2015-03-13 16:38:19 +05:00
Elias Dorneles
f7031c08ff updating list of Request.meta special keys 2015-03-10 22:29:07 -03:00
Mikhail Korobov
283d6a5344 DOC a couple more references are fixed 2015-01-19 22:07:03 +05:00
Mikhail Korobov
73e6b35622 DOC fix a reference 2015-01-19 22:02:46 +05:00
Mikhail Korobov
e435b3e3a3 DOC simplify extension docs 2014-09-21 00:19:24 +06:00
Mikhail Korobov
2d3803672b DOC use top-level shortcuts in docs 2014-04-15 01:09:35 +06:00
Nikolaos-Digenis Karagiannis
4335420f40 SpiderMW doc typo: SWP request, response 2014-03-06 16:09:37 +02:00
Mikhail Korobov
a27d91f0a6 Rename BaseSpider to Spider. See GH-495. 2013-12-30 19:46:41 +06:00
Pablo Hoffman
f87be371a2 better names for HANDLE_* settings, and added doc 2013-11-21 14:33:17 -02:00
Steven Almeroth
f62b6660d4 doc: fix typo in spider middleware 2013-03-02 19:46:31 -06:00
Chris Tilden
aae6aed4fb fixes spelling errors in documentation 2013-01-22 14:52:18 -08:00
Pablo Hoffman
be206ca5ab added process_start_requests method to spider middlewares 2012-08-31 16:41:50 -03:00
Pablo Hoffman
4ec99117d3 fixed minor doc typo 2012-08-30 11:56:30 -03:00
stav
f1802289cd small doc typo change to get the fork rolling 2012-04-11 12:05:39 -05:00
Pablo Hoffman
8933e2f2be added REFERER_ENABLED setting, to control referer middleware 2012-03-22 16:35:14 -03:00
Pablo Hoffman
ce03ccd4ec updated documentation about DEPTH_PRIORITY and DFO/BFO crawls 2011-09-23 13:22:25 -03:00
Daniel Grana
5f1b1c05f8 Do not filter requests with dont_filter attribute set in OffsiteMiddleware 2011-09-08 15:18:10 -03:00
Pablo Hoffman
549725215e Initial support for a persistent scheduler, to support pausing and resuming
crawls.

* requests are serialized (using marshal by default) and stored on disk, using
  one queue per priority
* request priorities must be integers now
* breadh-first and depth-first crawling orders can now be configured
  through a new DEPTH_PRIORITY setting (see doc). backwards compatilibty with
  SCHEDULER_ORDER was kept.
* requests that can't be serialized (for example, non serializable callbacks)
  are always kept in memory queues
* adapted crawl spider to work with persitent scheduler
2011-08-02 11:57:55 -03:00
Pablo Hoffman
e6091df551 fixed doc typo 2011-05-30 09:04:31 -03:00
Pablo Hoffman
9599bde3e9 Removed RequestLimitMiddleware 2010-09-22 16:09:13 -03:00
Pablo Hoffman
7f21a6384f Documented handle_httpstatus_list request.meta key 2010-09-09 21:50:40 -03:00
Pablo Hoffman
e9ebebb230 Removed UrlFilterMiddleware from scrapy.contrib - see this snippet for an alternative: http://snippets.scrapy.org/snippets/12/ 2010-09-07 17:51:02 -03:00
Pablo Hoffman
7b9fa7fbaa Don't filter out requests coming from spiders that don't define allowed_domains. Closes #225 2010-09-04 02:23:04 -03:00
Pablo Hoffman
9aefa242d5 Applied documentation patch provided by Lucian Ursu (closes #207) 2010-08-21 01:26:35 -03:00
Daniel Grana
c925c9e9a0 Notify spider when requests are ignored by HttpErrorMiddleware, and generally when any call to process_spider_input raises an exception 2010-05-12 16:41:06 -03:00
Rolando Espinoza La fuente
db5c3df679 SEP12 implementation
* Rename BaseSpider.domain_name to BaseSpider.name

    This patch implements the domain_name to name change in BaseSpider class and
    change all spider instantiations to use the new attribute.

  * Add allowed_domains to spider

    This patch implements the merging of spider.domain_name and
    spider.extra_domain_names in spider.allowed_domains for offsite checking
    purposes.

    Note that spider.domain_name is not touched by this patch, only not used.

  * Remove spider.domain_name references from scrapy.stats

    * Rename domain_stats to spider_stats in MemoryStatsCollector
    * Use ``spider`` instead of ``domain`` in SimpledbStatsCollector
    * Rename domain_stats_history table to spider_data_history and rename domain
    field to spider in MysqlStatsCollector

  * Refactor genspider command

    The new signature for genspider is: genspider [options] <domain_name>.

    Genspider uses domain_name for spider name and for the module name.

  * Remove spider.domain_name references

  * Update crawl command signature <spider|url>

  * docs: updated references to domain_name

  * examples/experimental: use spider.name

  * genspider: require <name> <domain>

  * spidermanager: renamed crawl_domain to crawl_spider_name

  * spiderctl: updated references of *domain* to spider

  * added backward compatiblity with legacy spider's attributes
    'domain_name' and 'extra_domain_names'
2010-04-01 18:27:22 -03:00
Pablo Hoffman
415dec4e16 made offsite middleware log messages when filtering out requests 2009-11-12 10:17:21 -02:00