Pablo Hoffman
38b5793152
Some changes to telnet console:
...
* moved module from scrapy.management.telnet to scrapy.telnet (to minimize
nested modules)
* added signal for updating telnet console variables (fixes #165 )
--HG--
rename : scrapy/management/telnet.py => scrapy/telnet.py
2010-06-02 17:49:18 -03:00
Pablo Hoffman
4595c92cc2
Core logic improvement: wait for Downloader and Scraper to close the spiders before going on and finish closing them
2010-06-01 13:49:01 -03:00
Pablo Hoffman
9523cab25c
Fixed bug that was causing the engine to notify the manager of spider closes too early
2010-06-01 11:07:04 -03:00
Ping Yin
fcdc4ee7d9
downloadermiddleware/redirect: always do "HEAD" if origin request method is HEAD
...
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-05-04 16:11:45 +08:00
Pablo Hoffman
031eb1e5ed
removed no longer used SpiderScheduler (obsoleted by ExecutionQueue)
2010-05-28 17:27:15 -03:00
Rolando Espinoza La fuente
e995c5c7ff
Skipped IBL tests if nltk/numpy are not available.
2010-05-28 16:53:17 -03:00
Ismael Carnales
a71dc295af
Some mail improvements and tests.
...
* Add mail_sent signal and use it in MailSender
* Add MAIL_DEBUG setting to not send mails when testing
* Add MailSender tests
2010-05-28 16:51:47 -03:00
Pablo Hoffman
dfa7b23959
Fixed SpiderManager tests that failed with dropin.cache write permissions errors in some cases
...
--HG--
rename : scrapy/tests/test_contrib_spidermanager/spider1.py => scrapy/tests/test_contrib_spidermanager/test_spiders/spider1.py
rename : scrapy/tests/test_contrib_spidermanager/spider2.py => scrapy/tests/test_contrib_spidermanager/test_spiders/spider2.py
2010-05-26 11:58:31 -03:00
Pablo Hoffman
dff763c683
Removed Scrapy engine singleton from scrapy.core.engine.scrapyengine. Now
...
engine can only be accesed through Scrapy Manager 'engine' attribute - ie.
scrapy.core.manager.engine.
2010-05-26 10:29:32 -03:00
Pablo Hoffman
2d3135603e
added scrapy-ctl view command
2010-05-26 10:29:32 -03:00
Pablo Hoffman
2905a2083b
moved scrapy.command.models module to scrapy.command
2010-05-26 10:29:32 -03:00
Pablo Hoffman
14bfeabede
moved scrapy.command.cmdline module to scrapy.cmdline (keeping backwards compatibility until 0.10)
...
--HG--
rename : scrapy/command/cmdline.py => scrapy/cmdline.py
2010-05-26 10:29:32 -03:00
Pablo Hoffman
56abafec61
moved scrapy.command.commands module to scrapy.commands
...
--HG--
rename : scrapy/command/commands/__init__.py => scrapy/commands/__init__.py
rename : scrapy/command/commands/crawl.py => scrapy/commands/crawl.py
rename : scrapy/command/commands/fetch.py => scrapy/commands/fetch.py
rename : scrapy/command/commands/genspider.py => scrapy/commands/genspider.py
rename : scrapy/command/commands/list.py => scrapy/commands/list.py
rename : scrapy/command/commands/parse.py => scrapy/commands/parse.py
rename : scrapy/command/commands/runspider.py => scrapy/commands/runspider.py
rename : scrapy/command/commands/settings.py => scrapy/commands/settings.py
rename : scrapy/command/commands/shell.py => scrapy/commands/shell.py
rename : scrapy/command/commands/start.py => scrapy/commands/start.py
rename : scrapy/command/commands/startproject.py => scrapy/commands/startproject.py
2010-05-26 10:29:32 -03:00
Pablo Hoffman
cae22930c8
Added ExecutionQueue class for feeding spiders and requests to scrape. This
...
class can (and is meant to) be subclassed by projects that want to use a custom
mechanism for feeding spiders to crawl. For example, a queue that pulls spiders
to scrape from Amazon SQS (an example will be added soon).
Also introduced a rather big core refactoring of Scrapy manager and Scrapy
engine.
2010-05-26 10:29:32 -03:00
Pablo Hoffman
8c1feb7ae4
Ported S3ImagesStore to use boto threads. This simplifies the code and makes
...
the following things no longer needed:
1. custom spider for S3 requests (ex. _S3AmazonAWSSpider)
2. scrapy.contrib.aws.AWSMiddleware
3. scrapy.utils.aws
2010-05-26 10:29:32 -03:00
Daniel Grana
c8c19a8e53
Automated merge with ssh://hg.scrapy.org/scrapy
2010-05-21 17:54:41 -03:00
Daniel Grana
cce9c4da49
silence HttpError exceptions raised by httperror spidermiddleware if not handled by spider
2010-05-21 17:54:32 -03:00
Ping Yin
f2363afe6f
LinkExtractor: split _process_links from _extract_links
...
Separate the extraction and process logic, so we can override in subclass easier.
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-27 14:58:11 +08:00
Ping Yin
6059221716
Compose: stop process on None value by default
...
By doing this, we can use str.lower as a processor safely without
checking whether the given value is None.
By passing stop_on_none=False as keyword argument, this behaviour can be changed.
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-08 10:59:47 +08:00
Ping Yin
15b879f845
ItemLoader: Update docs for {add,replace,get}_{value,xpath}
...
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-05-18 17:54:25 +08:00
Ping Yin
8f53a72306
ItemLoader: add test for adding a dict value
...
After arg_to_iter is changed to return [arg] if arg is a dict,
the added test will pass.
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-24 21:21:12 +08:00
Ping Yin
8497301784
arg_to_iter: return [arg] if arg is a dict
...
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-24 21:20:23 +08:00
Ping Yin
bd844f690b
{add,replace}_xpath: add processors, kw args and allow field_name to be None
...
Also add method get_xpath.
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:34:55 +08:00
Ping Yin
a6c315552c
ItemLoader: Update tests for {add,replace,get}_value
...
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:49:25 +08:00
Ping Yin
913b5db242
{add,replace,get}_value: accept keyword args, now only 're'
...
if re given, extract data from the given value by this regex
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:45:01 +08:00
Ping Yin
ddfaf6049f
{add,replace}_value: add processors args and allow field_name to be None
...
* value is first proccessed by processors before passing to input
processor
* if field_name is None, values for multiple fields may be
added/replaced. The keys of the processed value are as the field names
* add get_value function for the processor logic
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:42:55 +08:00
Ping Yin
cf35e09d35
ItemLoader: don't limit item to Item object
...
Now, for example, item can be a dict
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:28:57 +08:00
Pablo Hoffman
bfd9cb42e5
Automated merge with http://hg.scrapy.org/scrapy-0.8
2010-05-17 20:11:27 -03:00
Pablo Hoffman
076cdfd585
Added documentation about contributing to Scrapy
2010-05-17 20:10:46 -03:00
Pablo Hoffman
7a55158fed
fixed documentation bug (thanks rhill for reporting)
2010-05-11 11:25:03 -03:00
Steven Almeroth
5d03405cac
FormRequest.from_response doc fix. closes #155
...
--HG--
extra : rebase_source : d54979f6a15e5e997072dcbbc6d43b426189312b
2010-04-26 22:28:07 -03:00
Pablo Hoffman
2121a30c74
added note about installing Zope.Interface in windows platforms
2010-04-24 18:19:52 -03:00
Daniel Grana
6c12106803
Remove shpinx warning introduced by shorter title overline
2010-04-18 23:42:56 -03:00
Lucian Ursu
2f8c052484
#154 : Language fixes to the documentation
2010-04-18 23:39:54 -03:00
Ping Yin
d42e5fdbac
linkextractor: unique after urljoin_rfc
...
Now, '/foo.html' and 'http://example.org/foo.html ' are considered
as the same and only one is kept.
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-02 19:45:30 +08:00
Pablo Hoffman
1868ede549
bumped embedded pydispatch to 2.0.1
2010-05-14 16:38:04 -03:00
Pablo Hoffman
02b7ca7e8c
bumped embedded BeautifulSoup to 3.0.8.1
2010-05-14 16:30:50 -03:00
Daniel Grana
e528a77fa3
Automated merge with ssh://hg.scrapy.org/scrapy
2010-05-14 20:09:29 +01:00
Daniel Grana
b2f58207a4
avoid different behaviour in urljoin between pytho2.5 and python2.6+. see http://bugs.python.org/issue1432
2010-05-14 20:09:07 +01:00
Pablo Hoffman
c87a29eb9e
improved docstring
2010-05-14 14:48:34 -03:00
Pablo Hoffman
31843316bc
Added new instance based learning extraction library in scrapy.contrib.ibl. Documentation and tools will be added later.
2010-05-14 14:33:26 -03:00
Ping Yin
0b3bf5c6f6
downloader_handler: test HEAD method
2010-05-04 15:50:26 +08:00
Ping Yin
0aaa74d2bd
extract_regex: encoding arg defaults to 'utf-8'
...
Sometimes it is not neccessary to pass the encoding argument. For
example, when the text argument is unicode. So set a default encoding.
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-22 23:43:34 +08:00
Pablo Hoffman
dfdac356af
added missing default values to file xporter doc
2010-04-02 02:49:18 -03:00
Pablo Hoffman
2f75839e7a
Ignore noisy Twisted deprecation warnings
2010-03-27 13:23:13 -03:00
Pablo Hoffman
f19c939925
fixed doc typo
2010-03-26 08:28:32 -03:00
Pablo Hoffman
99a876754c
Improved "What else?" section of "Scrapy at a glance" overview
2010-03-20 20:24:18 -03:00
Pablo Hoffman
234fd709ad
fixed doc typo (thanks Victor)
2010-03-19 10:32:17 -03:00
Daniel Grana
184cf6684f
Remove HttpException references from docs. Since 0.7, scrapy returns non-200 as Response objects and does not raise HttpException anymore
2010-03-18 10:05:33 -03:00
Daniel Grana
17091902f3
Explicity say where to save item class in "Defining our item" section of tutorial
2010-03-12 14:12:49 -02:00