1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-24 13:23:59 +00:00

2048 Commits

Author SHA1 Message Date
Pablo Hoffman
38b5793152 Some changes to telnet console:
* moved module from scrapy.management.telnet to scrapy.telnet (to minimize
  nested modules)
* added signal for updating telnet console variables (fixes #165)

--HG--
rename : scrapy/management/telnet.py => scrapy/telnet.py
2010-06-02 17:49:18 -03:00
Pablo Hoffman
4595c92cc2 Core logic improvement: wait for Downloader and Scraper to close the spiders before going on and finish closing them 2010-06-01 13:49:01 -03:00
Pablo Hoffman
9523cab25c Fixed bug that was causing the engine to notify the manager of spider closes too early 2010-06-01 11:07:04 -03:00
Ping Yin
fcdc4ee7d9 downloadermiddleware/redirect: always do "HEAD" if origin request method is HEAD
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-05-04 16:11:45 +08:00
Pablo Hoffman
031eb1e5ed removed no longer used SpiderScheduler (obsoleted by ExecutionQueue) 2010-05-28 17:27:15 -03:00
Rolando Espinoza La fuente
e995c5c7ff Skipped IBL tests if nltk/numpy are not available. 2010-05-28 16:53:17 -03:00
Ismael Carnales
a71dc295af Some mail improvements and tests.
* Add mail_sent signal and use it in MailSender
* Add MAIL_DEBUG setting to not send mails when testing
* Add MailSender tests
2010-05-28 16:51:47 -03:00
Pablo Hoffman
dfa7b23959 Fixed SpiderManager tests that failed with dropin.cache write permissions errors in some cases
--HG--
rename : scrapy/tests/test_contrib_spidermanager/spider1.py => scrapy/tests/test_contrib_spidermanager/test_spiders/spider1.py
rename : scrapy/tests/test_contrib_spidermanager/spider2.py => scrapy/tests/test_contrib_spidermanager/test_spiders/spider2.py
2010-05-26 11:58:31 -03:00
Pablo Hoffman
dff763c683 Removed Scrapy engine singleton from scrapy.core.engine.scrapyengine. Now
engine can only be accesed through Scrapy Manager 'engine' attribute - ie.
scrapy.core.manager.engine.
2010-05-26 10:29:32 -03:00
Pablo Hoffman
2d3135603e added scrapy-ctl view command 2010-05-26 10:29:32 -03:00
Pablo Hoffman
2905a2083b moved scrapy.command.models module to scrapy.command 2010-05-26 10:29:32 -03:00
Pablo Hoffman
14bfeabede moved scrapy.command.cmdline module to scrapy.cmdline (keeping backwards compatibility until 0.10)
--HG--
rename : scrapy/command/cmdline.py => scrapy/cmdline.py
2010-05-26 10:29:32 -03:00
Pablo Hoffman
56abafec61 moved scrapy.command.commands module to scrapy.commands
--HG--
rename : scrapy/command/commands/__init__.py => scrapy/commands/__init__.py
rename : scrapy/command/commands/crawl.py => scrapy/commands/crawl.py
rename : scrapy/command/commands/fetch.py => scrapy/commands/fetch.py
rename : scrapy/command/commands/genspider.py => scrapy/commands/genspider.py
rename : scrapy/command/commands/list.py => scrapy/commands/list.py
rename : scrapy/command/commands/parse.py => scrapy/commands/parse.py
rename : scrapy/command/commands/runspider.py => scrapy/commands/runspider.py
rename : scrapy/command/commands/settings.py => scrapy/commands/settings.py
rename : scrapy/command/commands/shell.py => scrapy/commands/shell.py
rename : scrapy/command/commands/start.py => scrapy/commands/start.py
rename : scrapy/command/commands/startproject.py => scrapy/commands/startproject.py
2010-05-26 10:29:32 -03:00
Pablo Hoffman
cae22930c8 Added ExecutionQueue class for feeding spiders and requests to scrape. This
class can (and is meant to) be subclassed by projects that want to use a custom
mechanism for feeding spiders to crawl. For example, a queue that pulls spiders
to scrape from Amazon SQS (an example will be added soon).

Also introduced a rather big core refactoring of Scrapy manager and Scrapy
engine.
2010-05-26 10:29:32 -03:00
Pablo Hoffman
8c1feb7ae4 Ported S3ImagesStore to use boto threads. This simplifies the code and makes
the following things no longer needed:

1. custom spider for S3 requests (ex. _S3AmazonAWSSpider)
2. scrapy.contrib.aws.AWSMiddleware
3. scrapy.utils.aws
2010-05-26 10:29:32 -03:00
Daniel Grana
c8c19a8e53 Automated merge with ssh://hg.scrapy.org/scrapy 2010-05-21 17:54:41 -03:00
Daniel Grana
cce9c4da49 silence HttpError exceptions raised by httperror spidermiddleware if not handled by spider 2010-05-21 17:54:32 -03:00
Ping Yin
f2363afe6f LinkExtractor: split _process_links from _extract_links
Separate the extraction and process logic, so we can override in subclass easier.

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-27 14:58:11 +08:00
Ping Yin
6059221716 Compose: stop process on None value by default
By doing this, we can use str.lower as a processor safely without
checking whether the given value is None.

By passing stop_on_none=False as keyword argument, this behaviour can be changed.

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-08 10:59:47 +08:00
Ping Yin
15b879f845 ItemLoader: Update docs for {add,replace,get}_{value,xpath}
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-05-18 17:54:25 +08:00
Ping Yin
8f53a72306 ItemLoader: add test for adding a dict value
After arg_to_iter is changed to return [arg] if arg is a dict,
the added test will pass.

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-24 21:21:12 +08:00
Ping Yin
8497301784 arg_to_iter: return [arg] if arg is a dict
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-24 21:20:23 +08:00
Ping Yin
bd844f690b {add,replace}_xpath: add processors, kw args and allow field_name to be None
Also add method get_xpath.

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:34:55 +08:00
Ping Yin
a6c315552c ItemLoader: Update tests for {add,replace,get}_value
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:49:25 +08:00
Ping Yin
913b5db242 {add,replace,get}_value: accept keyword args, now only 're'
if re given, extract data from the given value by this regex

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:45:01 +08:00
Ping Yin
ddfaf6049f {add,replace}_value: add processors args and allow field_name to be None
* value is first proccessed by processors before passing to input
    processor
  * if field_name is None, values for multiple fields may be
    added/replaced. The keys of the processed value are as the field names
  * add get_value function for the processor logic

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:42:55 +08:00
Ping Yin
cf35e09d35 ItemLoader: don't limit item to Item object
Now, for example, item can be a dict

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:28:57 +08:00
Pablo Hoffman
bfd9cb42e5 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-05-17 20:11:27 -03:00
Pablo Hoffman
076cdfd585 Added documentation about contributing to Scrapy 2010-05-17 20:10:46 -03:00
Pablo Hoffman
7a55158fed fixed documentation bug (thanks rhill for reporting) 2010-05-11 11:25:03 -03:00
Steven Almeroth
5d03405cac FormRequest.from_response doc fix. closes #155
--HG--
extra : rebase_source : d54979f6a15e5e997072dcbbc6d43b426189312b
2010-04-26 22:28:07 -03:00
Pablo Hoffman
2121a30c74 added note about installing Zope.Interface in windows platforms 2010-04-24 18:19:52 -03:00
Daniel Grana
6c12106803 Remove shpinx warning introduced by shorter title overline 2010-04-18 23:42:56 -03:00
Lucian Ursu
2f8c052484 #154: Language fixes to the documentation 2010-04-18 23:39:54 -03:00
Ping Yin
d42e5fdbac linkextractor: unique after urljoin_rfc
Now, '/foo.html' and 'http://example.org/foo.html' are considered
as the same and only one is kept.

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-02 19:45:30 +08:00
Pablo Hoffman
1868ede549 bumped embedded pydispatch to 2.0.1 2010-05-14 16:38:04 -03:00
Pablo Hoffman
02b7ca7e8c bumped embedded BeautifulSoup to 3.0.8.1 2010-05-14 16:30:50 -03:00
Daniel Grana
e528a77fa3 Automated merge with ssh://hg.scrapy.org/scrapy 2010-05-14 20:09:29 +01:00
Daniel Grana
b2f58207a4 avoid different behaviour in urljoin between pytho2.5 and python2.6+. see http://bugs.python.org/issue1432 2010-05-14 20:09:07 +01:00
Pablo Hoffman
c87a29eb9e improved docstring 2010-05-14 14:48:34 -03:00
Pablo Hoffman
31843316bc Added new instance based learning extraction library in scrapy.contrib.ibl. Documentation and tools will be added later. 2010-05-14 14:33:26 -03:00
Ping Yin
0b3bf5c6f6 downloader_handler: test HEAD method 2010-05-04 15:50:26 +08:00
Ping Yin
0aaa74d2bd extract_regex: encoding arg defaults to 'utf-8'
Sometimes it is not neccessary to pass the encoding argument. For
example, when the text argument is unicode. So set a default encoding.

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-22 23:43:34 +08:00
Pablo Hoffman
dfdac356af added missing default values to file xporter doc 2010-04-02 02:49:18 -03:00
Pablo Hoffman
2f75839e7a Ignore noisy Twisted deprecation warnings 2010-03-27 13:23:13 -03:00
Pablo Hoffman
f19c939925 fixed doc typo 2010-03-26 08:28:32 -03:00
Pablo Hoffman
99a876754c Improved "What else?" section of "Scrapy at a glance" overview 2010-03-20 20:24:18 -03:00
Pablo Hoffman
234fd709ad fixed doc typo (thanks Victor) 2010-03-19 10:32:17 -03:00
Daniel Grana
184cf6684f Remove HttpException references from docs. Since 0.7, scrapy returns non-200 as Response objects and does not raise HttpException anymore 2010-03-18 10:05:33 -03:00
Daniel Grana
17091902f3 Explicity say where to save item class in "Defining our item" section of tutorial 2010-03-12 14:12:49 -02:00