1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-23 23:23:44 +00:00

2070 Commits

Author SHA1 Message Date
Pablo Hoffman
247fc26598 moved scrapy.tac to extras/
--HG--
rename : bin/scrapy.tac => extras/scrapy.tac
2010-06-13 23:09:08 -03:00
Pablo Hoffman
09182efaff added scrapy-sqs.py to deployed scripts 2010-06-13 19:17:17 -03:00
Pablo Hoffman
37f71a9957 upstart script: exec twistd and use pidfile 2010-06-13 18:59:52 -03:00
Pablo Hoffman
91e1e0aff3 fixed bug and updated old code in googledir example project 2010-06-13 17:31:33 -03:00
Pablo Hoffman
bd16d1cd48 Added SMTP-AUTH support to scrapy.mail (closes #149) 2010-06-13 17:14:46 -03:00
Pablo Hoffman
495f23dea2 utils.serialize: added support for encoding Deferreds, and to refer spiders by name using 'spider::name' 2010-06-11 18:16:09 -03:00
Pablo Hoffman
1b083911e6 scrapy-ws.py: added stop command 2010-06-11 18:14:01 -03:00
Pablo Hoffman
ed5d7561f9 Added SQS Execution Queue, and example script to add spiders to the queue 2010-06-11 17:22:14 -03:00
olveyra
efe9811d92 Populate annotation metadata with data not used by IBL extractor. 2010-06-11 13:09:56 -03:00
Pablo Hoffman
ea8b5ddfd5 debian package: fix dh_auto_build confusing with Makefile, added scrapy-ws.py to deployed scripts 2010-06-11 12:48:35 -03:00
Pablo Hoffman
03912a6504 Added Ping Yin to AUTHORS 2010-06-11 11:33:02 -03:00
Pablo Hoffman
d13b50a234 Added sources and Makefile for building Debian package 2010-06-11 01:18:16 -03:00
Pablo Hoffman
d76276408e scrapy.service: fixed minor logging bug on win32 platform with different line endings 2010-06-10 14:50:06 -03:00
Pablo Hoffman
a8b80f3e2f scrapy.service: added support for logging stdout/stderr tails of finished processes 2010-06-10 14:08:54 -03:00
Pablo Hoffman
a33e8b507f scrapy.service: fixed bug with process respawning 2010-06-10 13:39:45 -03:00
Pablo Hoffman
075b59f4af some improvements and fixes to scrapy.service 2010-06-10 11:51:46 -03:00
Pablo Hoffman
6a33d6c4d0 * Added Scrapy Web Service with documentation and tests.
* Marked Web Console as deprecated.
* Removed Web Console documentation to discourage its use.
2010-06-09 13:46:22 -03:00
Pablo Hoffman
2499dfee5e removed obsolete test 2010-06-09 13:06:05 -03:00
Daniel Grana
62f5c61a9d fix broken request tests. refs #166 2010-06-09 00:44:18 -03:00
Pablo Hoffman
73305b1eb3 Added support for Requests without callbacks (#166) - the Spider.parse() method
is used in those cases.

Also removed Request.deferred attribute.
2010-06-08 18:18:02 -03:00
Pablo Hoffman
76ed9d442b Relocated some modules:
* scrapy.spider.middelware moved to scrapy.core.spidermw
* scrapy.core.scheduler.schedulers to scrapy.core.scheduler
* scrapy.core.scheduler.middleware to scrapy.core.schedulermw

Also removed dir: scrapy/core/scheduler/

--HG--
rename : scrapy/core/scheduler/schedulers.py => scrapy/core/scheduler.py
rename : scrapy/core/scheduler/middleware.py => scrapy/core/schedulermw.py
rename : scrapy/spider/middleware.py => scrapy/core/spidermw.py
2010-06-07 15:11:25 -03:00
Pablo Hoffman
72df5cb7ef removed unused code 2010-06-03 01:07:40 -03:00
Pablo Hoffman
38b5793152 Some changes to telnet console:
* moved module from scrapy.management.telnet to scrapy.telnet (to minimize
  nested modules)
* added signal for updating telnet console variables (fixes #165)

--HG--
rename : scrapy/management/telnet.py => scrapy/telnet.py
2010-06-02 17:49:18 -03:00
Pablo Hoffman
4595c92cc2 Core logic improvement: wait for Downloader and Scraper to close the spiders before going on and finish closing them 2010-06-01 13:49:01 -03:00
Pablo Hoffman
9523cab25c Fixed bug that was causing the engine to notify the manager of spider closes too early 2010-06-01 11:07:04 -03:00
Ping Yin
fcdc4ee7d9 downloadermiddleware/redirect: always do "HEAD" if origin request method is HEAD
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-05-04 16:11:45 +08:00
Pablo Hoffman
031eb1e5ed removed no longer used SpiderScheduler (obsoleted by ExecutionQueue) 2010-05-28 17:27:15 -03:00
Rolando Espinoza La fuente
e995c5c7ff Skipped IBL tests if nltk/numpy are not available. 2010-05-28 16:53:17 -03:00
Ismael Carnales
a71dc295af Some mail improvements and tests.
* Add mail_sent signal and use it in MailSender
* Add MAIL_DEBUG setting to not send mails when testing
* Add MailSender tests
2010-05-28 16:51:47 -03:00
Pablo Hoffman
dfa7b23959 Fixed SpiderManager tests that failed with dropin.cache write permissions errors in some cases
--HG--
rename : scrapy/tests/test_contrib_spidermanager/spider1.py => scrapy/tests/test_contrib_spidermanager/test_spiders/spider1.py
rename : scrapy/tests/test_contrib_spidermanager/spider2.py => scrapy/tests/test_contrib_spidermanager/test_spiders/spider2.py
2010-05-26 11:58:31 -03:00
Pablo Hoffman
dff763c683 Removed Scrapy engine singleton from scrapy.core.engine.scrapyengine. Now
engine can only be accesed through Scrapy Manager 'engine' attribute - ie.
scrapy.core.manager.engine.
2010-05-26 10:29:32 -03:00
Pablo Hoffman
2d3135603e added scrapy-ctl view command 2010-05-26 10:29:32 -03:00
Pablo Hoffman
2905a2083b moved scrapy.command.models module to scrapy.command 2010-05-26 10:29:32 -03:00
Pablo Hoffman
14bfeabede moved scrapy.command.cmdline module to scrapy.cmdline (keeping backwards compatibility until 0.10)
--HG--
rename : scrapy/command/cmdline.py => scrapy/cmdline.py
2010-05-26 10:29:32 -03:00
Pablo Hoffman
56abafec61 moved scrapy.command.commands module to scrapy.commands
--HG--
rename : scrapy/command/commands/__init__.py => scrapy/commands/__init__.py
rename : scrapy/command/commands/crawl.py => scrapy/commands/crawl.py
rename : scrapy/command/commands/fetch.py => scrapy/commands/fetch.py
rename : scrapy/command/commands/genspider.py => scrapy/commands/genspider.py
rename : scrapy/command/commands/list.py => scrapy/commands/list.py
rename : scrapy/command/commands/parse.py => scrapy/commands/parse.py
rename : scrapy/command/commands/runspider.py => scrapy/commands/runspider.py
rename : scrapy/command/commands/settings.py => scrapy/commands/settings.py
rename : scrapy/command/commands/shell.py => scrapy/commands/shell.py
rename : scrapy/command/commands/start.py => scrapy/commands/start.py
rename : scrapy/command/commands/startproject.py => scrapy/commands/startproject.py
2010-05-26 10:29:32 -03:00
Pablo Hoffman
cae22930c8 Added ExecutionQueue class for feeding spiders and requests to scrape. This
class can (and is meant to) be subclassed by projects that want to use a custom
mechanism for feeding spiders to crawl. For example, a queue that pulls spiders
to scrape from Amazon SQS (an example will be added soon).

Also introduced a rather big core refactoring of Scrapy manager and Scrapy
engine.
2010-05-26 10:29:32 -03:00
Pablo Hoffman
8c1feb7ae4 Ported S3ImagesStore to use boto threads. This simplifies the code and makes
the following things no longer needed:

1. custom spider for S3 requests (ex. _S3AmazonAWSSpider)
2. scrapy.contrib.aws.AWSMiddleware
3. scrapy.utils.aws
2010-05-26 10:29:32 -03:00
Daniel Grana
c8c19a8e53 Automated merge with ssh://hg.scrapy.org/scrapy 2010-05-21 17:54:41 -03:00
Daniel Grana
cce9c4da49 silence HttpError exceptions raised by httperror spidermiddleware if not handled by spider 2010-05-21 17:54:32 -03:00
Ping Yin
f2363afe6f LinkExtractor: split _process_links from _extract_links
Separate the extraction and process logic, so we can override in subclass easier.

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-27 14:58:11 +08:00
Ping Yin
6059221716 Compose: stop process on None value by default
By doing this, we can use str.lower as a processor safely without
checking whether the given value is None.

By passing stop_on_none=False as keyword argument, this behaviour can be changed.

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-08 10:59:47 +08:00
Ping Yin
15b879f845 ItemLoader: Update docs for {add,replace,get}_{value,xpath}
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-05-18 17:54:25 +08:00
Ping Yin
8f53a72306 ItemLoader: add test for adding a dict value
After arg_to_iter is changed to return [arg] if arg is a dict,
the added test will pass.

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-24 21:21:12 +08:00
Ping Yin
8497301784 arg_to_iter: return [arg] if arg is a dict
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-24 21:20:23 +08:00
Ping Yin
bd844f690b {add,replace}_xpath: add processors, kw args and allow field_name to be None
Also add method get_xpath.

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:34:55 +08:00
Ping Yin
a6c315552c ItemLoader: Update tests for {add,replace,get}_value
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:49:25 +08:00
Ping Yin
913b5db242 {add,replace,get}_value: accept keyword args, now only 're'
if re given, extract data from the given value by this regex

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:45:01 +08:00
Ping Yin
ddfaf6049f {add,replace}_value: add processors args and allow field_name to be None
* value is first proccessed by processors before passing to input
    processor
  * if field_name is None, values for multiple fields may be
    added/replaced. The keys of the processed value are as the field names
  * add get_value function for the processor logic

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:42:55 +08:00
Ping Yin
cf35e09d35 ItemLoader: don't limit item to Item object
Now, for example, item can be a dict

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:28:57 +08:00
Pablo Hoffman
bfd9cb42e5 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-05-17 20:11:27 -03:00