1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-26 15:24:12 +00:00

893 Commits

Author SHA1 Message Date
Daniel Grana
8d5222f1db core: add missing imports
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40893
2009-02-20 11:31:09 +00:00
Daniel Grana
4bf196fc99 core: add scheduler middleware and move duplicate fitler there
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40892
2009-02-20 11:28:37 +00:00
Daniel Grana
549c30d8d5 redirect: remove stupid set
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40891
2009-02-20 11:27:46 +00:00
Daniel Grana
071aa71b68 add filters module
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40890
2009-02-19 19:53:30 +00:00
Daniel Grana
0e98526daa duplicatesfilter: add a singeton duplicates filter and adapt current middleware and redirection middleware
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40889
2009-02-19 19:43:24 +00:00
Daniel Grana
ac735a185e cleanup redirection middleware
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40888
2009-02-19 19:42:42 +00:00
Ismael Carnales
65e8be46bc add typecheck of funcs in ExtractorField.__init__
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40887
2009-02-19 19:05:45 +00:00
Ismael Carnales
edf5dfb264 added newitem.extractors based on old adaptors
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40886
2009-02-19 18:56:59 +00:00
Daniel Grana
756cda3873 duplicatefilter: dont raise ignorerequest, but add request to filter on spider input
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40885
2009-02-19 16:49:03 +00:00
Daniel Grana
fe7935005c duplicatefilter: filter request prior to reach spider
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40884
2009-02-19 16:37:33 +00:00
Daniel Grana
5338a48db8 remove missing import
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40883
2009-02-19 16:17:08 +00:00
Ismael Carnales
26bcc826e3 renamed ItemField to Field for upcoming ItemExtractor and FieldExtractor
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40882
2009-02-19 14:46:23 +00:00
Daniel Grana
7fc8d590f3 guid: cleanup guid attribute references including removal of replays and deprecated commands
--HG--
rename : scrapy/trunk/scrapy/contrib/history/__init__.py => scrapy/trunk/scrapy/contrib_exp/history/__init__.py
rename : scrapy/trunk/scrapy/contrib/history/history.py => scrapy/trunk/scrapy/contrib_exp/history/history.py
rename : scrapy/trunk/scrapy/contrib/history/middleware.py => scrapy/trunk/scrapy/contrib_exp/history/middleware.py
rename : scrapy/trunk/scrapy/contrib/history/scheduler.py => scrapy/trunk/scrapy/contrib_exp/history/scheduler.py
rename : scrapy/trunk/scrapy/contrib/history/store.py => scrapy/trunk/scrapy/contrib_exp/history/store.py
rename : scrapy/trunk/scrapy/contrib/pipeline/shoveitem.py => scrapy/trunk/scrapy/contrib_exp/pipeline/shoveitem.py
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40881
2009-02-19 14:22:23 +00:00
Ismael Carnales
81ce9bd458 calculate ItemField.default on __init__
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40880
2009-02-19 13:57:40 +00:00
Ismael Carnales
db4ccaf78f add __all__ to fields.py in newitem
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40879
2009-02-19 13:34:02 +00:00
Ismael Carnales
5c45ecb7ff turned default_value() of ItemField into a property
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40878
2009-02-19 12:45:54 +00:00
Ismael Carnales
76c84bda38 always deiter on ItemField value assignation
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40877
2009-02-19 12:39:58 +00:00
Daniel Grana
22549161a5 duplicatefilter: fix unittest to filter a request with same url as start_requests. rel r868. ref #49
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40876
2009-02-19 11:55:09 +00:00
Ismael Carnales
1017f12163 don't allow setting attributes that aren't fields, and return field default values on newitem
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40875
2009-02-19 11:40:21 +00:00
Ismael Carnales
d95878c0f7 corrected import paths in newitem
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40874
2009-02-19 11:26:01 +00:00
Pablo Hoffman
c11e5ad3dd fixed grammar error. thanks Fabio
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40873
2009-02-19 11:07:41 +00:00
samus_
c546585af0 conflict solved, reverted r869 and applied changes for r868
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40872
2009-02-19 05:12:55 +00:00
samus_
fefccfaa31 conflict among tests, reverting
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40871
2009-02-19 04:59:12 +00:00
samus_
14900a7452 fix to default parameter (must be tuple not string)
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40870
2009-02-19 04:21:49 +00:00
samus_
de13bbe7f5 fix to filter start_urls too
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40869
2009-02-19 04:19:05 +00:00
Ismael Carnales
9739f9320e move newitem from contrib to contrib.exp
--HG--
rename : scrapy/trunk/scrapy/contrib/newitem/__init__.py => scrapy/trunk/scrapy/contrib_exp/newitem/__init__.py
rename : scrapy/trunk/scrapy/contrib/newitem/fields.py => scrapy/trunk/scrapy/contrib_exp/newitem/fields.py
rename : scrapy/trunk/scrapy/contrib/newitem/models.py => scrapy/trunk/scrapy/contrib_exp/newitem/models.py
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40868
2009-02-18 18:57:26 +00:00
Ismael Carnales
3039453148 added newitem with new item model and fields
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40867
2009-02-18 18:38:26 +00:00
samus_
918908db8e removed empty dir
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40866
2009-02-18 15:22:15 +00:00
Ismael Carnales
f8affd0c3d added fields.py with ItemField class and subclasses based on Django's for new items
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40865
2009-02-18 13:36:47 +00:00
Ismael Carnales
b20ac057b3 added rule shorthand, for creating CrawlSpider rules
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40864
2009-02-18 11:18:11 +00:00
Daniel Grana
84f63b146b cluster: merge spider settings prior to run spider, not when scheduling
also reformat some long lines.

--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40863
2009-02-17 14:20:19 +00:00
Pablo Hoffman
714c3e20e4 minor typo fix
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40862
2009-02-16 23:45:14 +00:00
Pablo Hoffman
489e86bd16 moved copytree function to scrapy.utils.python (more appropiate location) and fixed minor bug ('Error' not defined). also removed unused var 'project_module_path'
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40861
2009-02-16 23:43:43 +00:00
Ismael Carnales
e6458d057b completed first version of basic tutorial
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40860
2009-02-16 16:42:35 +00:00
Ismael Carnales
730575a53f define WindowsError (also in shutils from 2.6)
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40859
2009-02-16 13:30:20 +00:00
Ismael Carnales
5c1dd25284 removed spiders cache from example project, thxs Michael
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40858
2009-02-16 13:15:55 +00:00
Ismael Carnales
3e8bcafd8b added copytree from python 2.6 to utils.misc and make startproject use it to ignore .svn and .pyc files
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40857
2009-02-16 12:50:00 +00:00
Ismael Carnales
522fe78c2b remove old project example
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40856
2009-02-16 11:52:49 +00:00
Ismael Carnales
bb9a732edf updated tutorial with new googledir project from r853
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40855
2009-02-16 11:51:07 +00:00
Ismael Carnales
eb1e62a28c add new googledir example project with new structure
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40854
2009-02-16 11:49:44 +00:00
Daniel Grana
755235b568 utils: renamed load_class function as load_object
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40853
2009-02-13 17:21:50 +00:00
Daniel Grana
3876a1827f cluster: move branched cluster as experimental contrib
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40852
2009-02-13 17:20:10 +00:00
Pablo Hoffman
47cf45c916 removed redudant part of Scrapy introduction to make it simpler. thanks Ismael for pointing that out
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40851
2009-02-12 20:58:42 +00:00
Daniel Grana
8557f6bc57 duplicatefilter: lower log level of skipped requests message
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40850
2009-02-12 15:55:11 +00:00
Daniel Grana
2b14251510 cache: read metadata only when when looking for cached items. refs #61
thanks Patrick Mezard for patch.

--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40849
2009-02-12 08:00:07 +00:00
Daniel Grana
d68646422d storedb: gracefully fail test if mysql is not installed.
thanks Patrick Mezard for patch.

--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40848
2009-02-12 07:59:40 +00:00
Daniel Grana
3e4bc6141a core: remove obsolte groupfilter code
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40847
2009-02-12 07:39:57 +00:00
Daniel Grana
c9f2865c83 core: Get rid of duplicate filtering as a scheduler builtin feature. closes #49.
Implements a DuplicatesFilterMiddleware as spidermiddleware, a wraper
using a minimal defined API of a filtering class configurable by
settings.

Enabling this middleware doesn't gives us same functionality compared to
scheduler duplicate filter builtin, but it filter the most important
source for duplicate requests, the spiders.

What requests aren't filtered by new middleware? The ones originated
from any part of scrapy outside of spiders, like S3 images requests or
any other request manually schedule using ``scrapyengine.schedule()``
method.

Previously, we usually added dont_filter=True to requests created
outside of spiders to avoid collisions downloading same pages than
spider. Now, this is not required anymore because new middleware filters
just the spider generated requests.

There is a caveat, as usual downloadmiddlewares can returns a Request
object at any point of the chain, and that request is scheduled and
downloaded as usual too. One of the downloadmiddlewares using this
feature is RedirectMiddleware that counts on scheduler filtering builtin
to avoid redirection loops. I think we can implement a request time to
live decreasing counter and add it to request's ``meta`` attribute with
a default value if not present, and decrement each time the request is
redirected.

--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40846
2009-02-12 07:07:13 +00:00
Daniel Grana
0aba276a64 tutorial: indent class docstring
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40845
2009-02-12 04:39:49 +00:00
Daniel Grana
73d8177ecc tutorial: lot of line wrapping and changes to double backticks instead of emphatized words
--HG--
extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40844
2009-02-12 04:38:13 +00:00