1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-03-04 10:17:47 +00:00

3091 Commits

Author SHA1 Message Date
Pablo Hoffman
2fb5e62c39 doc: update overview page to point to the genspider command. refs #107 2012-04-19 02:37:22 -03:00
Pablo Hoffman
1c5294bee1 update docstring in project template to avoid confusion with genspider command, which may be considered as an advanced feature. refs #107 2012-04-19 02:35:48 -03:00
Pablo Hoffman
d567d8efbe added note to docs/topics/firebug.rst about google directory being shut down 2012-04-19 01:34:20 -03:00
Daniel Graña
21e03729a3 lxml is the new default selector backend. closes #120 2012-04-19 00:28:27 -03:00
Daniel Graña
6bb40fe5a8 use xpath to match img tags in one shot. #119 2012-04-19 00:03:04 -03:00
Andrés Moreira
e24107feb8 fix HTMLImageLinkExtractor to work with libxml2 and lxml selectors 2012-04-18 18:05:44 -03:00
Pablo Hoffman
30ddbf624e mention about some scrapy.xlib modules removed in the release notes 2012-04-17 12:31:18 -03:00
Pablo Hoffman
d99ee6deb9 added some missing entries to release notes 2012-04-17 12:29:48 -03:00
Daniel Graña
60ee5d6213 Merge branch 'lxml-formrequest' 2012-04-15 00:47:39 -03:00
Daniel Graña
5b0df465e7 SELECT matched as form inputs but hasnot type attribute. #111 2012-04-15 00:47:30 -03:00
Daniel Graña
150e5734d3 Merge pull request #111 from LucianU/lxml-formrequest
Lxml formrequest
2012-04-13 12:29:04 -07:00
Daniel Graña
39395eb4f7 iteritems returns tuple elements duh!. #111 2012-04-13 16:22:25 -03:00
Daniel Graña
b88bdd05c4 do not add clickable if it is in formdata. #111 2012-04-13 16:20:35 -03:00
Daniel Graña
51e5aadd1d simplify formdata type infering. #111 2012-04-13 16:07:27 -03:00
Daniel Graña
a11ef7fba7 reuse LxmlDocument in FormRequest. #111 2012-04-13 15:41:58 -03:00
Daniel Graña
9e10abcc43 Merge branch 'master' into lxml-formrequest 2012-04-13 14:50:49 -03:00
Daniel Graña
1789a55f28 Merge pull request #117 from dangra/lxml-document
Lxml document
2012-04-13 10:49:35 -07:00
Daniel Graña
4c7d29b7f7 Cache response's element trees using LxmlDocument similar to Libxml2Document 2012-04-13 14:43:30 -03:00
Daniel Graña
ac4f6cc17c unify libxml2 document and factories 2012-04-13 13:54:48 -03:00
Daniel Graña
2904dc2dc0 no need for MultipleElementsFound exception. #111 2012-04-13 13:28:16 -03:00
Daniel Graña
ee1f7847a4 more lxml form fixes and test cases. #111
* Do not treat "coord" attribute specially, just pass "NN,NN" as clickdata value
* Raise explicit ValueError if not clickable is found
* Fix bug looking for clickeables trough xpath when there is more than one form
* Test from_response with multiple clickdata
2012-04-13 13:09:21 -03:00
Daniel Graña
32b9f788be lxml form request cleanup. #111
* remove unused _nons function copied from lxml.html
* compute clickables only if dont_click is False
* less _get_clickables function branch nesting
2012-04-13 12:50:47 -03:00
Daniel Graña
e4d22cb16a reuse form_values() method from lxml to avoid copying code. #111 2012-04-13 10:31:59 -03:00
Daniel Graña
18a35a9fd6 Merge branch 'lxml-selectors' 2012-04-13 09:32:51 -03:00
Daniel Graña
3dbe211d29 lxml boolean results fix. oops #116 2012-04-13 09:29:56 -03:00
Daniel Graña
a338c29287 Merge pull request #116 from dangra/lxml-selectors
more fixes to lxml selector incompatibilities
2012-04-13 04:38:45 -07:00
Daniel Graña
b9efa5ee73 more fixes to lxml selector incompatibilities
* Do not fail parsing empty bodies
* Do not fail parsing bodies with null bytes
* Recode to utf8 using response.body_as_unicode() to avoid decoding bugs
* Return empty results with unevaluable nodes like text or attribute nodes
* Return u'1' and u'0' for boolean xpaths
2012-04-13 00:58:31 -03:00
Daniel Graña
d8ebf16fe5 Merge pull request #114 from stav/master
Scrapy DOC changes
2012-04-11 12:08:45 -07:00
Pablo Hoffman
7cca916ed5 added release notes to official documentation, including all release notes since Scrapy 0.7 2012-04-11 15:53:23 -03:00
Lucian Ursu
c760cc5cd8 Copied lxml.html._nons to not rely on that module's private interface and took out check out of the for loop because it can be done only once 2012-04-11 21:40:00 +03:00
Lucian Ursu
f13a547203 Removed unnecessary iteration of formdata items 2012-04-11 21:08:43 +03:00
stav
f1802289cd small doc typo change to get the fork rolling 2012-04-11 12:05:39 -05:00
Daniel Graña
02833e3265 fix typo in module description. closes #112 2012-04-11 10:44:31 -03:00
Daniel Graña
a0a1a5026b do formdata encoding and serialization in one place. refs #111 2012-04-11 10:07:56 -03:00
Pablo Hoffman
4f28ffcb2c removed no longer needed dependency on simplejson 2012-04-10 16:01:36 -03:00
Pablo Hoffman
6e8edbd72e switched default selectors backend to lxml 2012-04-10 15:52:14 -03:00
Daniel Graña
af0e1c40f5 Avoid logging useless error messages about ignored requests in robots.txt 2012-04-10 13:37:32 -03:00
Lucian Ursu
4be6c22c4d Removed ClientForm with its patch and tests, and BeautifulSoup 2012-04-10 10:24:30 +03:00
Lucian Ursu
df2e795278 Added test case to make sure that ambiguous clickdata is not allowed 2012-04-10 10:19:59 +03:00
Lucian Ursu
eb47849c05 Replaced ClientForm-based FormRequest with a lxml-based implementation 2012-04-10 10:18:54 +03:00
Daniel Graña
97e4003a56 do not fail handling unicode xpaths in libxml2 backed selectors 2012-04-04 17:18:31 -03:00
Pablo Hoffman
ab4dd928ee Merge pull request #108 from kalessin/throttleslot
Fix autothrottle in order to modify also inactive downloader slots, so c...
2012-04-03 17:02:31 -07:00
olveyra
e6d7afa13b Fix autothrottle in order to modify also inactive downloader slots, so cases fixed by inactive slots patch will work ok also when using autothrottle 2012-04-03 23:14:00 +00:00
Pablo Hoffman
c27f7eb7e9 Merge pull request #106 from kalessin/downloader2
dont discard slot when empty, just save in another dict in order to recycle if needed again
2012-04-02 14:39:38 -07:00
olveyra
b39cb22d83 dont discard slot when empty, just save in another dict in order to recycle if needed again.
This fix avoids to continuosly create new slot under certain cases, bug that prevents download_delay and max_concurrent_requests to work properly.

The problem arises when the slot for a given domain becomes empty, but further requests for that domain werent still created by the spider. This is typical when spider creates requests one by one, or it makes requests to multiple domains and one or more of them are created in a rate enough slow that makes slot to be empty each time the response is fetched.

The effect is that a new slot is created for each request under such conditions, and so the download_delay and max_concurrent_requests are not taking effect (because in order to apply, depends on an already existing slot for that domain).
2012-04-02 20:34:57 +00:00
Pablo Hoffman
e9184def35 make selector re() method use re.UNICODE flag to compile regexes 2012-04-01 00:41:03 -03:00
Pablo Hoffman
27018fced7 changed default user agent to Scrapy/0.15 (+http://scrapy.org) and removed no longer needed BOT_VERSION setting 2012-03-23 13:45:21 -03:00
Pablo Hoffman
731c569b5c fixed test-scrapyd.sh script after changed on insophia website 2012-03-22 16:38:28 -03:00
Pablo Hoffman
8933e2f2be added REFERER_ENABLED setting, to control referer middleware 2012-03-22 16:35:14 -03:00
Pablo Hoffman
eed34e88cd Merge pull request #103 from jsyeo/patch-1
fixed minor mistake in Request objects documentation
2012-03-20 19:49:31 -07:00