2014-03-06 17:39:41 -05:00
|
|
|
|
======= =========================================
|
|
|
|
|
SEP 06
|
|
|
|
|
Title Extractors
|
|
|
|
|
Author Ismael Carnales and a bunch of rabid mice
|
|
|
|
|
Created 2009-07-28
|
|
|
|
|
Status Obsolete (discarded)
|
|
|
|
|
======= =========================================
|
|
|
|
|
|
|
|
|
|
==========================================
|
|
|
|
|
SEP-006: Rename of Selectors to Extractors
|
|
|
|
|
==========================================
|
|
|
|
|
|
2019-03-01 16:56:58 +01:00
|
|
|
|
This SEP proposes a more meaningful naming of XPathSelectors or "Selectors" and
|
|
|
|
|
their ``x`` method.
|
2014-03-06 17:39:41 -05:00
|
|
|
|
|
|
|
|
|
Motivation
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
When you use Selectors in Scrapy, your final goal is to "extract" the data that
|
2019-01-31 01:28:53 -03:00
|
|
|
|
you've selected, as the [https://docs.scrapy.org/en/latest/topics/selectors.html
|
2014-03-06 17:39:41 -05:00
|
|
|
|
XPath Selectors documentation] says (bolding by me):
|
|
|
|
|
|
2014-03-06 17:55:37 -05:00
|
|
|
|
When you’re scraping web pages, the most common task you need to perform is
|
|
|
|
|
to **extract** data from the HTML source.
|
2014-03-06 17:39:41 -05:00
|
|
|
|
|
2014-03-06 17:54:16 -05:00
|
|
|
|
..
|
2014-03-06 17:53:16 -05:00
|
|
|
|
|
2014-03-06 17:55:37 -05:00
|
|
|
|
Scrapy comes with its own mechanism for **extracting** data. They’re called
|
2014-03-06 17:39:41 -05:00
|
|
|
|
``XPath`` selectors (or just “selectors”, for short) because they “select”
|
2014-03-06 17:55:37 -05:00
|
|
|
|
certain parts of the HTML document specified by ``XPath`` expressions.
|
2014-03-06 17:39:41 -05:00
|
|
|
|
|
2014-03-06 17:54:16 -05:00
|
|
|
|
..
|
2014-03-06 17:53:16 -05:00
|
|
|
|
|
2014-03-06 17:55:37 -05:00
|
|
|
|
To actually **extract** the textual data you must call the selector
|
|
|
|
|
``extract()`` method, as follows
|
2014-03-06 17:39:41 -05:00
|
|
|
|
|
2014-03-06 17:54:16 -05:00
|
|
|
|
..
|
2014-03-06 17:53:16 -05:00
|
|
|
|
|
2014-03-06 17:55:37 -05:00
|
|
|
|
Selectors also have a ``re()`` method for **extracting** data using regular
|
|
|
|
|
expressions.
|
2014-03-06 17:39:41 -05:00
|
|
|
|
|
2014-03-06 17:54:16 -05:00
|
|
|
|
..
|
2014-03-06 17:53:16 -05:00
|
|
|
|
|
2014-03-06 17:55:37 -05:00
|
|
|
|
For example, suppose you want to **extract** all <p> elements inside <div>
|
|
|
|
|
elements. First you get would get all <div> elements
|
2014-03-06 17:39:41 -05:00
|
|
|
|
|
|
|
|
|
Rationale
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
As and there is no ``Extractor`` object in Scrapy and what you want to finally
|
|
|
|
|
perform with ``Selectors`` is extracting data, we propose the renaming of
|
|
|
|
|
``Selectors`` to ``Extractors``. (In Scrapy for extracting you use selectors is
|
|
|
|
|
really weird :) )
|
|
|
|
|
|
|
|
|
|
Additional changes
|
|
|
|
|
==================
|
|
|
|
|
|
|
|
|
|
As the name of the method for performing selection (the ``x`` method) is not
|
|
|
|
|
descriptive nor mnemotechnic enough and clearly clashes with ``extract`` method
|
|
|
|
|
(x sounds like a short for extract in english), we propose to rename it to
|
2019-03-01 16:56:58 +01:00
|
|
|
|
``select``, ``sel`` (is shortness if required), or ``xpath`` after `lxml's
|
2017-10-28 16:34:49 +05:30
|
|
|
|
<http://lxml.de/xpathxslt.html>`_ ``xpath`` method.
|
2014-03-06 17:39:41 -05:00
|
|
|
|
|
|
|
|
|
Bonus (ItemBuilder)
|
|
|
|
|
===================
|
|
|
|
|
|
|
|
|
|
After this renaming we propose also renaming ``ItemBuilder`` to ``ItemExtractor``,
|
|
|
|
|
because the ``ItemBuilder``/``Extractor`` will act as a bridge between a set of
|
|
|
|
|
``Extractors`` and an ``Item`` and because it will literally "extract" an item from a
|
|
|
|
|
webpage or set of pages.
|
|
|
|
|
|
|
|
|
|
References
|
|
|
|
|
==========
|
|
|
|
|
|
2019-01-31 01:28:53 -03:00
|
|
|
|
1. XPath Selectors (https://docs.scrapy.org/topics/selectors.html)
|
2017-10-28 16:34:49 +05:30
|
|
|
|
2. XPath and XSLT with lxml (http://lxml.de/xpathxslt.html)
|