1
0
mirror of https://github.com/scrapy/scrapy.git synced 2025-02-25 08:24:05 +00:00
scrapy/docs/topics/link-extractors.rst

136 lines
5.3 KiB
ReStructuredText
Raw Normal View History

.. _topics-link-extractors:
===============
Link Extractors
===============
2014-07-02 23:15:17 +06:00
Link extractors are objects whose only purpose is to extract links from web
pages (:class:`scrapy.http.Response` objects) which will be eventually
followed.
There is ``scrapy.linkextractors import LinkExtractor`` available
2014-07-02 23:15:17 +06:00
in Scrapy, but you can create your own custom Link Extractors to suit your
needs by implementing a simple interface.
2014-07-02 23:15:17 +06:00
The only public method that every link extractor has is ``extract_links``,
which receives a :class:`~scrapy.http.Response` object and returns a list
2014-07-02 23:15:17 +06:00
of :class:`scrapy.link.Link` objects. Link extractors are meant to be
instantiated once and their ``extract_links`` method called several times
with different responses to extract links to follow.
Link extractors are used in the :class:`~scrapy.contrib.spiders.CrawlSpider`
class (available in Scrapy), through a set of rules, but you can also use it in
your spiders, even if you don't subclass from
:class:`~scrapy.contrib.spiders.CrawlSpider`, as its purpose is very simple: to
extract links.
2009-08-18 14:05:15 -03:00
.. _topics-link-extractors-ref:
Built-in link extractors reference
==================================
.. module:: scrapy.linkextractors
2009-08-18 14:05:15 -03:00
:synopsis: Link extractors classes
2014-07-02 23:15:17 +06:00
Link extractors classes bundled with Scrapy are provided in the
:mod:`scrapy.linkextractors` module.
2009-08-18 14:05:15 -03:00
2014-07-02 23:15:17 +06:00
The default link extractor is ``LinkExtractor``, which is the same as
:class:`~.LxmlLinkExtractor`::
from scrapy.linkextractors import LinkExtractor
2014-07-02 23:15:17 +06:00
There used to be other link extractor classes in previous Scrapy versions,
but they are deprecated now.
2014-06-25 14:58:27 -03:00
LxmlLinkExtractor
-----------------
.. module:: scrapy.linkextractors.lxmlhtml
2014-06-21 01:06:04 +02:00
:synopsis: lxml's HTMLParser-based link extractors
.. class:: LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href',), canonicalize=True, unique=True, process_value=None)
2014-06-21 01:06:04 +02:00
2014-06-25 14:58:27 -03:00
LxmlLinkExtractor is the recommended link extractor with handy filtering
options. It is implemented using lxml's robust HTMLParser.
2014-06-21 01:06:04 +02:00
:param allow: a single regular expression (or list of regular expressions)
that the (absolute) urls must match in order to be extracted. If not
given (or empty), it will match all links.
:type allow: a regular expression (or list of)
:param deny: a single regular expression (or list of regular expressions)
that the (absolute) urls must match in order to be excluded (ie. not
extracted). It has precedence over the ``allow`` parameter. If not
given (or empty) it won't exclude any links.
:type deny: a regular expression (or list of)
:param allow_domains: a single value or a list of string containing
domains which will be considered for extracting the links
:type allow_domains: str or list
:param deny_domains: a single value or a list of strings containing
domains which won't be considered for extracting the links
:type deny_domains: str or list
:param deny_extensions: a single value or list of strings containing
extensions that should be ignored when extracting links.
If not given, it will default to the
``IGNORED_EXTENSIONS`` list defined in the `scrapy.linkextractor`_
module.
:type deny_extensions: list
2015-04-19 18:58:15 +04:00
:param restrict_xpaths: is an XPath (or list of XPath's) which defines
2014-06-21 01:06:04 +02:00
regions inside the response where links should be extracted from.
If given, only the text selected by those XPath will be scanned for
links. See examples below.
:type restrict_xpaths: str or list
:param restrict_css: a CSS selector (or list of selectors) which defines
regions inside the response where links should be extracted from.
Has the same behaviour as ``restrict_xpaths``.
:type restrict_css: str or list
2014-06-21 01:06:04 +02:00
:param tags: a tag or a list of tags to consider when extracting links.
Defaults to ``('a', 'area')``.
:type tags: str or list
:param attrs: an attribute or list of attributes which should be considered when looking
for links to extract (only for those tags specified in the ``tags``
parameter). Defaults to ``('href',)``
:type attrs: list
:param canonicalize: canonicalize each extracted url (using
scrapy.utils.url.canonicalize_url). Defaults to ``True``.
:type canonicalize: boolean
:param unique: whether duplicate filtering should be applied to extracted
links.
:type unique: boolean
2009-08-18 14:05:15 -03:00
:param process_value: a function which receives each value extracted from
the tag and attributes scanned and can modify the value and return a
new one, or return ``None`` to ignore the link altogether. If not
given, ``process_value`` defaults to ``lambda x: x``.
.. highlight:: html
For example, to extract links from this code::
<a href="javascript:goToPage('../other/page.html'); return false">Link text</a>
2014-06-21 01:06:04 +02:00
2009-08-18 14:05:15 -03:00
.. highlight:: python
You can use the following function in ``process_value``::
2014-06-21 01:06:04 +02:00
2009-08-18 14:05:15 -03:00
def process_value(value):
m = re.search("javascript:goToPage\('(.*?)'", value)
if m:
2014-06-21 01:06:04 +02:00
return m.group(1)
2009-08-18 14:05:15 -03:00
:type process_value: callable
.. _scrapy.linkextractor: https://github.com/scrapy/scrapy/blob/master/scrapy/linkextractor.py