scrapy

mirror of https://github.com/scrapy/scrapy.git synced 2025-03-04 10:17:47 +00:00

Author	SHA1	Message	Date
Pablo Hoffman	2fb5e62c39	doc: update overview page to point to the genspider command. refs #107	2012-04-19 02:37:22 -03:00
Pablo Hoffman	1c5294bee1	update docstring in project template to avoid confusion with genspider command, which may be considered as an advanced feature. refs #107	2012-04-19 02:35:48 -03:00
Pablo Hoffman	d567d8efbe	added note to docs/topics/firebug.rst about google directory being shut down	2012-04-19 01:34:20 -03:00
Daniel Graña	21e03729a3	lxml is the new default selector backend. closes #120	2012-04-19 00:28:27 -03:00
Daniel Graña	6bb40fe5a8	use xpath to match img tags in one shot. #119	2012-04-19 00:03:04 -03:00
Andrés Moreira	e24107feb8	fix HTMLImageLinkExtractor to work with libxml2 and lxml selectors	2012-04-18 18:05:44 -03:00
Pablo Hoffman	30ddbf624e	mention about some scrapy.xlib modules removed in the release notes	2012-04-17 12:31:18 -03:00
Pablo Hoffman	d99ee6deb9	added some missing entries to release notes	2012-04-17 12:29:48 -03:00
Daniel Graña	60ee5d6213	Merge branch 'lxml-formrequest'	2012-04-15 00:47:39 -03:00
Daniel Graña	5b0df465e7	SELECT matched as form inputs but hasnot type attribute. #111	2012-04-15 00:47:30 -03:00
Daniel Graña	150e5734d3	Merge pull request #111 from LucianU/lxml-formrequest Lxml formrequest	2012-04-13 12:29:04 -07:00
Daniel Graña	39395eb4f7	iteritems returns tuple elements duh!. #111	2012-04-13 16:22:25 -03:00
Daniel Graña	b88bdd05c4	do not add clickable if it is in formdata. #111	2012-04-13 16:20:35 -03:00
Daniel Graña	51e5aadd1d	simplify formdata type infering. #111	2012-04-13 16:07:27 -03:00
Daniel Graña	a11ef7fba7	reuse LxmlDocument in FormRequest. #111	2012-04-13 15:41:58 -03:00
Daniel Graña	9e10abcc43	Merge branch 'master' into lxml-formrequest	2012-04-13 14:50:49 -03:00
Daniel Graña	1789a55f28	Merge pull request #117 from dangra/lxml-document Lxml document	2012-04-13 10:49:35 -07:00
Daniel Graña	4c7d29b7f7	Cache response's element trees using LxmlDocument similar to Libxml2Document	2012-04-13 14:43:30 -03:00
Daniel Graña	ac4f6cc17c	unify libxml2 document and factories	2012-04-13 13:54:48 -03:00
Daniel Graña	2904dc2dc0	no need for MultipleElementsFound exception. #111	2012-04-13 13:28:16 -03:00
Daniel Graña	ee1f7847a4	more lxml form fixes and test cases. #111 * Do not treat "coord" attribute specially, just pass "NN,NN" as clickdata value * Raise explicit ValueError if not clickable is found * Fix bug looking for clickeables trough xpath when there is more than one form * Test from_response with multiple clickdata	2012-04-13 13:09:21 -03:00
Daniel Graña	32b9f788be	lxml form request cleanup. #111 * remove unused _nons function copied from lxml.html * compute clickables only if dont_click is False * less _get_clickables function branch nesting	2012-04-13 12:50:47 -03:00
Daniel Graña	e4d22cb16a	reuse form_values() method from lxml to avoid copying code. #111	2012-04-13 10:31:59 -03:00
Daniel Graña	18a35a9fd6	Merge branch 'lxml-selectors'	2012-04-13 09:32:51 -03:00
Daniel Graña	3dbe211d29	lxml boolean results fix. oops #116	2012-04-13 09:29:56 -03:00
Daniel Graña	a338c29287	Merge pull request #116 from dangra/lxml-selectors more fixes to lxml selector incompatibilities	2012-04-13 04:38:45 -07:00
Daniel Graña	b9efa5ee73	more fixes to lxml selector incompatibilities * Do not fail parsing empty bodies * Do not fail parsing bodies with null bytes * Recode to utf8 using response.body_as_unicode() to avoid decoding bugs * Return empty results with unevaluable nodes like text or attribute nodes * Return u'1' and u'0' for boolean xpaths	2012-04-13 00:58:31 -03:00
Daniel Graña	d8ebf16fe5	Merge pull request #114 from stav/master Scrapy DOC changes	2012-04-11 12:08:45 -07:00
Pablo Hoffman	7cca916ed5	added release notes to official documentation, including all release notes since Scrapy 0.7	2012-04-11 15:53:23 -03:00
Lucian Ursu	c760cc5cd8	Copied lxml.html._nons to not rely on that module's private interface and took out check out of the for loop because it can be done only once	2012-04-11 21:40:00 +03:00
Lucian Ursu	f13a547203	Removed unnecessary iteration of formdata items	2012-04-11 21:08:43 +03:00
stav	f1802289cd	small doc typo change to get the fork rolling	2012-04-11 12:05:39 -05:00
Daniel Graña	02833e3265	fix typo in module description. closes #112	2012-04-11 10:44:31 -03:00
Daniel Graña	a0a1a5026b	do formdata encoding and serialization in one place. refs #111	2012-04-11 10:07:56 -03:00
Pablo Hoffman	4f28ffcb2c	removed no longer needed dependency on simplejson	2012-04-10 16:01:36 -03:00
Pablo Hoffman	6e8edbd72e	switched default selectors backend to lxml	2012-04-10 15:52:14 -03:00
Daniel Graña	af0e1c40f5	Avoid logging useless error messages about ignored requests in robots.txt	2012-04-10 13:37:32 -03:00
Lucian Ursu	4be6c22c4d	Removed ClientForm with its patch and tests, and BeautifulSoup	2012-04-10 10:24:30 +03:00
Lucian Ursu	df2e795278	Added test case to make sure that ambiguous clickdata is not allowed	2012-04-10 10:19:59 +03:00
Lucian Ursu	eb47849c05	Replaced ClientForm-based FormRequest with a lxml-based implementation	2012-04-10 10:18:54 +03:00
Daniel Graña	97e4003a56	do not fail handling unicode xpaths in libxml2 backed selectors	2012-04-04 17:18:31 -03:00
Pablo Hoffman	ab4dd928ee	Merge pull request #108 from kalessin/throttleslot Fix autothrottle in order to modify also inactive downloader slots, so c...	2012-04-03 17:02:31 -07:00
olveyra	e6d7afa13b	Fix autothrottle in order to modify also inactive downloader slots, so cases fixed by inactive slots patch will work ok also when using autothrottle	2012-04-03 23:14:00 +00:00
Pablo Hoffman	c27f7eb7e9	Merge pull request #106 from kalessin/downloader2 dont discard slot when empty, just save in another dict in order to recycle if needed again	2012-04-02 14:39:38 -07:00
olveyra	b39cb22d83	dont discard slot when empty, just save in another dict in order to recycle if needed again. This fix avoids to continuosly create new slot under certain cases, bug that prevents download_delay and max_concurrent_requests to work properly. The problem arises when the slot for a given domain becomes empty, but further requests for that domain werent still created by the spider. This is typical when spider creates requests one by one, or it makes requests to multiple domains and one or more of them are created in a rate enough slow that makes slot to be empty each time the response is fetched. The effect is that a new slot is created for each request under such conditions, and so the download_delay and max_concurrent_requests are not taking effect (because in order to apply, depends on an already existing slot for that domain).	2012-04-02 20:34:57 +00:00
Pablo Hoffman	e9184def35	make selector re() method use re.UNICODE flag to compile regexes	2012-04-01 00:41:03 -03:00
Pablo Hoffman	27018fced7	changed default user agent to Scrapy/0.15 (+http://scrapy.org ) and removed no longer needed BOT_VERSION setting	2012-03-23 13:45:21 -03:00
Pablo Hoffman	731c569b5c	fixed test-scrapyd.sh script after changed on insophia website	2012-03-22 16:38:28 -03:00
Pablo Hoffman	8933e2f2be	added REFERER_ENABLED setting, to control referer middleware	2012-03-22 16:35:14 -03:00
Pablo Hoffman	eed34e88cd	Merge pull request #103 from jsyeo/patch-1 fixed minor mistake in Request objects documentation	2012-03-20 19:49:31 -07:00

1 2 3 4 5 ...

3091 Commits