1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-25 08:24:05 +00:00

2133 Commits

Author SHA1 Message Date
Daniel Grana
c8c19a8e53 Automated merge with ssh://hg.scrapy.org/scrapy 2010-05-21 17:54:41 -03:00
Daniel Grana
cce9c4da49 silence HttpError exceptions raised by httperror spidermiddleware if not handled by spider 2010-05-21 17:54:32 -03:00
Ping Yin
f2363afe6f LinkExtractor: split _process_links from _extract_links
Separate the extraction and process logic, so we can override in subclass easier.

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-27 14:58:11 +08:00
Ping Yin
6059221716 Compose: stop process on None value by default
By doing this, we can use str.lower as a processor safely without
checking whether the given value is None.

By passing stop_on_none=False as keyword argument, this behaviour can be changed.

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-08 10:59:47 +08:00
Ping Yin
15b879f845 ItemLoader: Update docs for {add,replace,get}_{value,xpath}
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-05-18 17:54:25 +08:00
Ping Yin
8f53a72306 ItemLoader: add test for adding a dict value
After arg_to_iter is changed to return [arg] if arg is a dict,
the added test will pass.

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-24 21:21:12 +08:00
Ping Yin
8497301784 arg_to_iter: return [arg] if arg is a dict
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-24 21:20:23 +08:00
Ping Yin
bd844f690b {add,replace}_xpath: add processors, kw args and allow field_name to be None
Also add method get_xpath.

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:34:55 +08:00
Ping Yin
a6c315552c ItemLoader: Update tests for {add,replace,get}_value
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:49:25 +08:00
Ping Yin
913b5db242 {add,replace,get}_value: accept keyword args, now only 're'
if re given, extract data from the given value by this regex

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:45:01 +08:00
Ping Yin
ddfaf6049f {add,replace}_value: add processors args and allow field_name to be None
* value is first proccessed by processors before passing to input
    processor
  * if field_name is None, values for multiple fields may be
    added/replaced. The keys of the processed value are as the field names
  * add get_value function for the processor logic

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:42:55 +08:00
Ping Yin
cf35e09d35 ItemLoader: don't limit item to Item object
Now, for example, item can be a dict

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-23 01:28:57 +08:00
Pablo Hoffman
bfd9cb42e5 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-05-17 20:11:27 -03:00
Pablo Hoffman
076cdfd585 Added documentation about contributing to Scrapy 2010-05-17 20:10:46 -03:00
Pablo Hoffman
7a55158fed fixed documentation bug (thanks rhill for reporting) 2010-05-11 11:25:03 -03:00
Steven Almeroth
5d03405cac FormRequest.from_response doc fix. closes #155
--HG--
extra : rebase_source : d54979f6a15e5e997072dcbbc6d43b426189312b
2010-04-26 22:28:07 -03:00
Pablo Hoffman
2121a30c74 added note about installing Zope.Interface in windows platforms 2010-04-24 18:19:52 -03:00
Daniel Grana
6c12106803 Remove shpinx warning introduced by shorter title overline 2010-04-18 23:42:56 -03:00
Lucian Ursu
2f8c052484 #154: Language fixes to the documentation 2010-04-18 23:39:54 -03:00
Ping Yin
d42e5fdbac linkextractor: unique after urljoin_rfc
Now, '/foo.html' and 'http://example.org/foo.html' are considered
as the same and only one is kept.

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-02 19:45:30 +08:00
Pablo Hoffman
1868ede549 bumped embedded pydispatch to 2.0.1 2010-05-14 16:38:04 -03:00
Pablo Hoffman
02b7ca7e8c bumped embedded BeautifulSoup to 3.0.8.1 2010-05-14 16:30:50 -03:00
Daniel Grana
e528a77fa3 Automated merge with ssh://hg.scrapy.org/scrapy 2010-05-14 20:09:29 +01:00
Daniel Grana
b2f58207a4 avoid different behaviour in urljoin between pytho2.5 and python2.6+. see http://bugs.python.org/issue1432 2010-05-14 20:09:07 +01:00
Pablo Hoffman
c87a29eb9e improved docstring 2010-05-14 14:48:34 -03:00
Pablo Hoffman
31843316bc Added new instance based learning extraction library in scrapy.contrib.ibl. Documentation and tools will be added later. 2010-05-14 14:33:26 -03:00
Ping Yin
0b3bf5c6f6 downloader_handler: test HEAD method 2010-05-04 15:50:26 +08:00
Ping Yin
0aaa74d2bd extract_regex: encoding arg defaults to 'utf-8'
Sometimes it is not neccessary to pass the encoding argument. For
example, when the text argument is unicode. So set a default encoding.

Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-22 23:43:34 +08:00
Pablo Hoffman
dfdac356af added missing default values to file xporter doc 2010-04-02 02:49:18 -03:00
Pablo Hoffman
2f75839e7a Ignore noisy Twisted deprecation warnings 2010-03-27 13:23:13 -03:00
Pablo Hoffman
f19c939925 fixed doc typo 2010-03-26 08:28:32 -03:00
Pablo Hoffman
99a876754c Improved "What else?" section of "Scrapy at a glance" overview 2010-03-20 20:24:18 -03:00
Pablo Hoffman
234fd709ad fixed doc typo (thanks Victor) 2010-03-19 10:32:17 -03:00
Daniel Grana
184cf6684f Remove HttpException references from docs. Since 0.7, scrapy returns non-200 as Response objects and does not raise HttpException anymore 2010-03-18 10:05:33 -03:00
Daniel Grana
17091902f3 Explicity say where to save item class in "Defining our item" section of tutorial 2010-03-12 14:12:49 -02:00
Pablo Hoffman
c5cd8b9d3d Fixed bug in open_in_browser() function with Python 2.5 (closes #145). 2010-03-12 09:31:05 -02:00
Ping Yin
90fef3cbcd ImagePipeline: show http code when failing to download
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-02-27 18:09:50 +08:00
Ping Yin
5c60ef69ab remove_tags: add keep argument
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-24 19:08:01 +08:00
Ping Yin
94e6acebab Fix remove_tags like functions can't remove empty tag such as <br/>
Signed-off-by: Ping Yin <pkufranky@gmail.com>
2010-04-24 18:18:43 +08:00
Daniel Grana
c925c9e9a0 Notify spider when requests are ignored by HttpErrorMiddleware, and generally when any call to process_spider_input raises an exception 2010-05-12 16:41:06 -03:00
Daniel Grana
d3ab3cf85c url_query_cleaner: cleanup and avoid rejoining key-sep-value to build the query again
--HG--
extra : rebase_source : 7c2648b6dd1c2253f1ec0f11d5e1f2ee25bd1273
2010-05-12 14:09:37 -03:00
Pablo Hoffman
3fb8058016 Automated merge with http://hg.scrapy.org/scrapy-0.8 2010-05-11 11:25:24 -03:00
Pablo Hoffman
1750e233f7 moved import to top 2010-05-11 11:23:56 -03:00
Daniel Grana
ac646a3b47 url_query_cleaner: do not append ? if query is empty 2010-04-30 16:19:59 -03:00
Daniel Grana
3d731ba641 url_query_cleaner: add exclude and non-unique parameters support, also remove untested exception catching code and add missing tests 2010-04-30 09:41:11 -03:00
Daniel Grana
c0d45846b8 Automated merge with ssh://hg.scrapy.org/scrapy-0.8 2010-04-26 22:29:45 -03:00
Pablo Hoffman
81f6502e37 Automated merge with http://hg.scrapy.org/scrapy-0.8/ 2010-04-24 18:22:13 -03:00
Daniel Grana
658e6f15e9 Automated merge with ssh://hg.scrapy.org/scrapy-0.8 2010-04-18 23:44:59 -03:00
Pablo Hoffman
b94abf36a3 Added scrapy.utils.py26.json to use python2.6 json module when available, otherwise failback to simplejson module or scrapy.xlib.simplejson. This way we can always assume json and avoid conditional code. 2010-04-12 10:44:07 -03:00
Pablo Hoffman
cd6aa72d7f fixed import 2010-04-12 10:42:07 -03:00