scrapy/docs/topics/link-extractors.rst

.. _topics-link-extractors:

===============
Link Extractors
===============

Link extractors are objects whose only purpose is to extract links from web
pages (:class:`scrapy.http.Response` objects) which will be eventually
followed.

There is ``scrapy.linkextractors import LinkExtractor`` available
in Scrapy, but you can create your own custom Link Extractors to suit your
needs by implementing a simple interface.

The only public method that every link extractor has is ``extract_links``,
which receives a :class:`~scrapy.http.Response` object and returns a list
of :class:`scrapy.link.Link` objects. Link extractors are meant to be
instantiated once and their ``extract_links`` method called several times
with different responses to extract links to follow.

Link extractors are used in the :class:`~scrapy.contrib.spiders.CrawlSpider`
class (available in Scrapy), through a set of rules, but you can also use it in
your spiders, even if you don't subclass from
:class:`~scrapy.contrib.spiders.CrawlSpider`, as its purpose is very simple: to
extract links.


.. _topics-link-extractors-ref:

Built-in link extractors reference
==================================

.. module:: scrapy.linkextractors
   :synopsis: Link extractors classes

Link extractors classes bundled with Scrapy are provided in the
:mod:`scrapy.linkextractors` module.

The default link extractor is ``LinkExtractor``, which is the same as
:class:`~.LxmlLinkExtractor`::

    from scrapy.linkextractors import LinkExtractor

There used to be other link extractor classes in previous Scrapy versions,
but they are deprecated now.

LxmlLinkExtractor
-----------------

.. module:: scrapy.linkextractors.lxmlhtml
   :synopsis: lxml's HTMLParser-based link extractors


.. class:: LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href',), canonicalize=True, unique=True, process_value=None)

    LxmlLinkExtractor is the recommended link extractor with handy filtering
    options. It is implemented using lxml's robust HTMLParser.

    :param allow: a single regular expression (or list of regular expressions)
        that the (absolute) urls must match in order to be extracted. If not
        given (or empty), it will match all links.
    :type allow: a regular expression (or list of)

    :param deny: a single regular expression (or list of regular expressions)
        that the (absolute) urls must match in order to be excluded (ie. not
        extracted). It has precedence over the ``allow`` parameter. If not
        given (or empty) it won't exclude any links.
    :type deny: a regular expression (or list of)

    :param allow_domains: a single value or a list of string containing
        domains which will be considered for extracting the links
    :type allow_domains: str or list

    :param deny_domains: a single value or a list of strings containing
        domains which won't be considered for extracting the links
    :type deny_domains: str or list

    :param deny_extensions: a single value or list of strings containing
        extensions that should be ignored when extracting links.
        If not given, it will default to the
        ``IGNORED_EXTENSIONS`` list defined in the `scrapy.linkextractor`_
        module.
    :type deny_extensions: list

    :param restrict_xpaths: is an XPath (or list of XPath's) which defines
        regions inside the response where links should be extracted from.
        If given, only the text selected by those XPath will be scanned for
        links. See examples below.
    :type restrict_xpaths: str or list

    :param restrict_css: a CSS selector (or list of selectors) which defines
        regions inside the response where links should be extracted from.
        Has the same behaviour as ``restrict_xpaths``.
    :type restrict_css: str or list

    :param tags: a tag or a list of tags to consider when extracting links.
        Defaults to ``('a', 'area')``.
    :type tags: str or list

    :param attrs: an attribute or list of attributes which should be considered when looking
        for links to extract (only for those tags specified in the ``tags``
        parameter). Defaults to ``('href',)``
    :type attrs: list

    :param canonicalize: canonicalize each extracted url (using
        scrapy.utils.url.canonicalize_url). Defaults to ``True``.
    :type canonicalize: boolean

    :param unique: whether duplicate filtering should be applied to extracted
        links.
    :type unique: boolean

    :param process_value: a function which receives each value extracted from
        the tag and attributes scanned and can modify the value and return a
        new one, or return ``None`` to ignore the link altogether. If not
        given, ``process_value`` defaults to ``lambda x: x``.

        .. highlight:: html

        For example, to extract links from this code::

            <a href="javascript:goToPage('../other/page.html'); return false">Link text</a>

        .. highlight:: python

        You can use the following function in ``process_value``::

            def process_value(value):
                m = re.search("javascript:goToPage\('(.*?)'", value)
                if m:
                    return m.group(1)

    :type process_value: callable

.. _scrapy.linkextractor: https://github.com/scrapy/scrapy/blob/master/scrapy/linkextractor.py
splitted spiders doc from link extractor docs, moved the corresponding parts to ref and topics --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40812 2009-01-30 21:53:40 +00:00			`.. _topics-link-extractors:`

			`===============`
			`Link Extractors`
			`===============`

deprecate SgmlLinkExtractor 2014-07-02 23:15:17 +06:00			`Link extractors are objects whose only purpose is to extract links from web`
removed redundant link extractors documentation in docstrings, and some updated to link extractors doc --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401101 2009-04-29 23:16:02 +00:00			pages (:class:`scrapy.http.Response` objects) which will be eventually
			`followed.`
splitted spiders doc from link extractor docs, moved the corresponding parts to ref and topics --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40812 2009-01-30 21:53:40 +00:00
Move scrapy/contrib/linkextractors to scrapy/linkextractors 2015-04-20 22:55:33 -03:00			There is ``scrapy.linkextractors import LinkExtractor`` available
deprecate SgmlLinkExtractor 2014-07-02 23:15:17 +06:00			`in Scrapy, but you can create your own custom Link Extractors to suit your`
			`needs by implementing a simple interface.`
splitted spiders doc from link extractor docs, moved the corresponding parts to ref and topics --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40812 2009-01-30 21:53:40 +00:00
deprecate SgmlLinkExtractor 2014-07-02 23:15:17 +06:00			The only public method that every link extractor has is ``extract_links``,
removed redundant link extractors documentation in docstrings, and some updated to link extractors doc --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401101 2009-04-29 23:16:02 +00:00			which receives a :class:`~scrapy.http.Response` object and returns a list
deprecate SgmlLinkExtractor 2014-07-02 23:15:17 +06:00			of :class:`scrapy.link.Link` objects. Link extractors are meant to be
			instantiated once and their ``extract_links`` method called several times
			`with different responses to extract links to follow.`
splitted spiders doc from link extractor docs, moved the corresponding parts to ref and topics --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40812 2009-01-30 21:53:40 +00:00
removed redundant link extractors documentation in docstrings, and some updated to link extractors doc --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401101 2009-04-29 23:16:02 +00:00			Link extractors are used in the :class:`~scrapy.contrib.spiders.CrawlSpider`
			`class (available in Scrapy), through a set of rules, but you can also use it in`
Applied documentation patch provided by Lucian Ursu (closes #207) 2010-08-21 01:26:35 -03:00			`your spiders, even if you don't subclass from`
removed redundant link extractors documentation in docstrings, and some updated to link extractors doc --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%401101 2009-04-29 23:16:02 +00:00			:class:`~scrapy.contrib.spiders.CrawlSpider`, as its purpose is very simple: to
			`extract links.`
splitted spiders doc from link extractor docs, moved the corresponding parts to ref and topics --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40812 2009-01-30 21:53:40 +00:00
merged topics and reference doc 2009-08-18 14:05:15 -03:00
			`.. _topics-link-extractors-ref:`

			`Built-in link extractors reference`
			`==================================`

Move scrapy/contrib/linkextractors to scrapy/linkextractors 2015-04-20 22:55:33 -03:00			`.. module:: scrapy.linkextractors`
merged topics and reference doc 2009-08-18 14:05:15 -03:00			`:synopsis: Link extractors classes`

deprecate SgmlLinkExtractor 2014-07-02 23:15:17 +06:00			`Link extractors classes bundled with Scrapy are provided in the`
Move scrapy/contrib/linkextractors to scrapy/linkextractors 2015-04-20 22:55:33 -03:00			:mod:`scrapy.linkextractors` module.
merged topics and reference doc 2009-08-18 14:05:15 -03:00
deprecate SgmlLinkExtractor 2014-07-02 23:15:17 +06:00			The default link extractor is ``LinkExtractor``, which is the same as
			:class:`~.LxmlLinkExtractor`::
promote LxmlLinkExtractor as default in docs 2014-06-25 14:34:30 -03:00
Move scrapy/contrib/linkextractors to scrapy/linkextractors 2015-04-20 22:55:33 -03:00			`from scrapy.linkextractors import LinkExtractor`
promote LxmlLinkExtractor as default in docs 2014-06-25 14:34:30 -03:00
deprecate SgmlLinkExtractor 2014-07-02 23:15:17 +06:00			`There used to be other link extractor classes in previous Scrapy versions,`
			`but they are deprecated now.`
promote LxmlLinkExtractor as default in docs 2014-06-25 14:34:30 -03:00
address latest comments 2014-06-25 14:58:27 -03:00			`LxmlLinkExtractor`
			`-----------------`

Move scrapy/contrib/linkextractors to scrapy/linkextractors 2015-04-20 22:55:33 -03:00			`.. module:: scrapy.linkextractors.lxmlhtml`
Add doc on LxmlLinkExtractor class 2014-06-21 01:06:04 +02:00			`:synopsis: lxml's HTMLParser-based link extractors`


documentation for CSS support in link extractors 2014-12-11 18:22:08 -02:00			`.. class:: LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href',), canonicalize=True, unique=True, process_value=None)`
Add doc on LxmlLinkExtractor class 2014-06-21 01:06:04 +02:00
address latest comments 2014-06-25 14:58:27 -03:00			`LxmlLinkExtractor is the recommended link extractor with handy filtering`
			`options. It is implemented using lxml's robust HTMLParser.`
Add doc on LxmlLinkExtractor class 2014-06-21 01:06:04 +02:00
			`:param allow: a single regular expression (or list of regular expressions)`
			`that the (absolute) urls must match in order to be extracted. If not`
			`given (or empty), it will match all links.`
			`:type allow: a regular expression (or list of)`

			`:param deny: a single regular expression (or list of regular expressions)`
			`that the (absolute) urls must match in order to be excluded (ie. not`
			extracted). It has precedence over the ``allow`` parameter. If not
			`given (or empty) it won't exclude any links.`
			`:type deny: a regular expression (or list of)`

			`:param allow_domains: a single value or a list of string containing`
			`domains which will be considered for extracting the links`
			`:type allow_domains: str or list`

			`:param deny_domains: a single value or a list of strings containing`
			`domains which won't be considered for extracting the links`
			`:type deny_domains: str or list`

			`:param deny_extensions: a single value or list of strings containing`
			`extensions that should be ignored when extracting links.`
			`If not given, it will default to the`
			``IGNORED_EXTENSIONS`` list defined in the `scrapy.linkextractor`_
			`module.`
			`:type deny_extensions: list`

minor corrections in documentation. 2015-04-19 18:58:15 +04:00			`:param restrict_xpaths: is an XPath (or list of XPath's) which defines`
Add doc on LxmlLinkExtractor class 2014-06-21 01:06:04 +02:00			`regions inside the response where links should be extracted from.`
			`If given, only the text selected by those XPath will be scanned for`
			`links. See examples below.`
			`:type restrict_xpaths: str or list`

documentation for CSS support in link extractors 2014-12-11 18:22:08 -02:00			`:param restrict_css: a CSS selector (or list of selectors) which defines`
			`regions inside the response where links should be extracted from.`
			Has the same behaviour as ``restrict_xpaths``.
			`:type restrict_css: str or list`

Add doc on LxmlLinkExtractor class 2014-06-21 01:06:04 +02:00			`:param tags: a tag or a list of tags to consider when extracting links.`
			Defaults to ``('a', 'area')``.
			`:type tags: str or list`

			`:param attrs: an attribute or list of attributes which should be considered when looking`
			for links to extract (only for those tags specified in the ``tags``
			parameter). Defaults to ``('href',)``
			`:type attrs: list`

			`:param canonicalize: canonicalize each extracted url (using`
			scrapy.utils.url.canonicalize_url). Defaults to ``True``.
			`:type canonicalize: boolean`

			`:param unique: whether duplicate filtering should be applied to extracted`
			`links.`
			`:type unique: boolean`

merged topics and reference doc 2009-08-18 14:05:15 -03:00			`:param process_value: a function which receives each value extracted from`
			`the tag and attributes scanned and can modify the value and return a`
			new one, or return ``None`` to ignore the link altogether. If not
			given, ``process_value`` defaults to ``lambda x: x``.

			`.. highlight:: html`

			`For example, to extract links from this code::`

			`<a href="javascript:goToPage('../other/page.html'); return false">Link text</a>`
Add doc on LxmlLinkExtractor class 2014-06-21 01:06:04 +02:00
merged topics and reference doc 2009-08-18 14:05:15 -03:00			`.. highlight:: python`

			You can use the following function in ``process_value``::
Add doc on LxmlLinkExtractor class 2014-06-21 01:06:04 +02:00
merged topics and reference doc 2009-08-18 14:05:15 -03:00			`def process_value(value):`
			`m = re.search("javascript:goToPage\('(.*?)'", value)`
			`if m:`
Add doc on LxmlLinkExtractor class 2014-06-21 01:06:04 +02:00			`return m.group(1)`
merged topics and reference doc 2009-08-18 14:05:15 -03:00
			`:type process_value: callable`
splitted spiders doc from link extractor docs, moved the corresponding parts to ref and topics --HG-- extra : convert_revision : svn%3Ab85faa78-f9eb-468e-a121-7cced6da292c%40812 2009-01-30 21:53:40 +00:00
updated documentation to remove references to old issue tracker and mercurial repos 2011-09-25 13:06:24 -03:00			`.. _scrapy.linkextractor: https://github.com/scrapy/scrapy/blob/master/scrapy/linkextractor.py`