265x Filetype PDF File size 0.21 MB Source: aclanthology.org
Trafilatura: A Web Scraping Library and Command-Line Tool
for Text Discovery and Extraction
Adrien Barbaresi
Center for Digital Lexicography of German (ZDL)
Berlin-Brandenburg Academy of Sciences (BBAW)
Jgerstr. 22-23, 10117 Berlin, Germany
barbaresi@bbaw.de
Abstract Asignificant challenge lies in the ability to ex-
Anessential operation in web corpus construc- tract and pre-process web data to meet scientific
tion consists in retaining the desired content expectations with respect to text quality. An es-
while discarding the rest. Another challenge sential operation in corpus construction consists
finding one’s way through websites. This ar- in retaining the desired content while discarding
ticle introduces a text discovery and extrac- the rest, a task carrying various names referring to
tion tool published under open-source license. specific subtasks or to pre-processing as a whole:
Its installation and use is straightforward, no- webscraping, boilerplate removal, web page seg-
tably from Python and on the command-line. mentation, web page cleaning, template extraction,
The software allows for main text, comments or content extraction. This step is sometimes over-
and metadata extraction, while also providing looked although it involves a series of design de-
building blocks for web crawling tasks. A cisions and turning points in data processing. De-
comparativeevaluationonreal-worlddataalso
showsitsinterestaswellastheperformanceof pendingonthepurposeofdatacollection,adequate
other available solutions. filtering and quality assessment can be crucial. It
Thecontributionsofthispaperarethreefold: it has a significant impact on a wide range of down-
references the software, features a benchmark, stream applications like text analysis, information
andprovides a meaningful baseline for similar retrieval, link analysis, page adaptation to other ter-
tasks. The tool performs significantly better minalsandscreens,andespeciallynaturallanguage
than other open-source solutions in this evalu- processing pipelines.
ation and in external benchmarks. Another challenge is how to find one’s way
1 Introduction through the Web, notably as linguistic data are
1.1 Gathering texts from the Web gathered by running targeted web crawlers (Scan-
nell, 2007). As web crawling involves discarding
As useful monolingual text corpora across lan- muchofthedownloadedcontent(Olston and Na-
guages are highly relevant for the NLP community jork, 2010), especially link filtering and prioritiza-
(Caswell et al., 2020), web corpora seem to be a tion can prove to be tricky for contexts in which
natural way to gather language data. Corpus con- data collection is just the first step of a project, so
struction usually involves “crawling, downloading, that time resources for this task are scarce. Data
‘cleaning’ and de-duplicating the data, then linguis- collection approaches using the CommonCrawl1
tically annotating it and loading it into a corpus have flourished as they allow for faster download
query tool” (Kilgarriff, 2007). However, although and processing by skipping (or more precisely out-
text is ubiquitous on the Web, drawing accurate sourcing) the crawling phase. Barring the fact that
information from web pages can be difficult. In ad- finding one’s “own” way through the Web can be
dition, the vastly increasing variety of corpora, text preferable, such data should not be used without
types and use cases makes it more and more diffi- forethought and exhaustive filtering. Beside the dis-
cult to assess the usefulness and appropriateness of covery of relevant websites, a major issue consists
certain web texts for given research objectives. As in selecting appropriate content after download and
a result, content adequacy, focus and quality need ¨
processing(Schaferetal.,2013),whichcanbecom-
to be evaluated after the downloads (Baroni et al.,
2009). 1https://commoncrawl.org
122
Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing: System Demonstrations, pages 122–131, August 1st - August 6th, 2021.
©2021Association for Computational Linguistics
plex due to unexpected machine-generated flaws from a cleaner text base. In the concrete case of
and biases. linguistic and lexicographic research, it allows for
Finally, depending on the project’s jurisdiction, content queries on meaningful parts of the docu-
legal aspects of retrieving and granting access ments.
to web documents can be unclear or restrictive. The remainder of this article introduces a text
Boundaries of copyright law are not clear when it extraction and web navigation tool published un-
comes to corpus building (De Clercq and Perez, der open-source license. Its installation and use is
2010) so that some corpus infrastructure projects straightforward, notably from Python and on the
leave it to users to decide what to do from a copy- command-line. The software makes it easier to ex-
right standpoint (Benko, 2016). Copyright and tract the main text, comments and metadata, while
intellectual property rights usually do not apply also providing building blocks for text discovery
to resources such as language models or n-grams tasks such as web crawling. The following also
(Buck et al., 2014), so are shuffled sentences (Bie- entails a comparative evaluation of text extraction
mannetal., 2007). Web corpora focusing on man- on real-world data. The contributions of this paper
ually selected sources under Creative Commons are thus threefold as it references the software, fea-
licenses have been built (Brunello, 2009; Lyding tures a benchmark, and provides a fast, meaningful
et al., 2014), although only a very small propor- baseline for similar tasks.
¨
tion of websites use them (Barbaresi and Wurzner, 2 State of the art
2014). Corporabasedonmachine-checkedlicenses
havealsobeendeveloped(Habernaletal.,2016),as 2.1 “Adifficult IE problem”
well as systems to merge annotation with web parts Even before the “Web 2.0” paradigm with web
¨
from the CommonCrawl (Schafer, 2016). Consid- pages assembling information from and for a va-
ering the progresses of annotation tools, is can be riety of sources (notably the advertising industry),
easier to retrieve documents directly from the Web webpageshavebeenknownfortheirlackoffocus
or from archives and to process them to one’s taste. ondirectly usable text content. Despite the quantity
1.2 Research context of pages following an article format where there is
This effort is part of methods to derive informa- a main text to be found, web pages now accessible
tion from web documents in order to build text through archives cannot be expected to be easy to
databases for a lexicographic information plat- process: “Articles published on the WWW often
form (Geyken et al., 2017). Extracting and pre- contain extraneous clutter. Most articles consist
processing web texts to the exacting standards of of a main body which constitutes the relevant part
scientific research turned out to be a substantial of the particular page. [...] Identifying the main
challenge where existing open-source solutions bodyofawebpageinageneralrobustmanneris
were not entirely convincing in terms of accuracy, a difficult information extraction problem.” (Finn
versatility, and ease of use. The current tool fol- et al., 2001)
lows from earlier work on news and blog articles Web pages come in different shapes and sizes
extraction (Barbaresi, 2015, 2016). Its packaging mostly because of the wide variety of platforms
into a directly re-usable format generalizes the pro- and content management systems, and not least be-
cess and makes it available to the community, with cause of varying reasons to publish and diverging
thorough testing it has also become much more goals followed during web publication. Web page
robust and versatile. structure is also constantly evolving from the per-
spective of standards. HTML 5 was first released in
1.3 Contributions 2008 to provide support for multimedia and graph-
Distinguishing between a whole page and the ical elements. This standard streamlined syntax
page’s essential parts can help to alleviate many while retaining backward-compatibility. Web con-
quality problems related to web text processing, no- tent extraction is also an active field of research in
tably by dealing with the noise caused by recurring user experience, resulting from the need for higher
elements (headers and footers, ads, links/blogroll, download and rendering speeds as well as from a
etc.). This can be particularly useful to de-duplicate growing amount of “Web bloat” requiring the de-
2
recurring language samples. Tasks related to con- velopment of “reader modes” and “distillers” for
tent extraction and language modeling also benefit 2https://chromium.googlesource.com/chromium/dom-
123
webbrowsers(Ghasemisharif et al., 2019). density have proven to be good indicators in order
to select or discard content nodes, using the cu-
2.2 Wrappers mulative distribution of tags (Finn et al., 2001), or
Data extraction has first been based on “wrappers” with approaches such as the content extraction via
(now called “scrapers”) which were mostly rely- tag ratios (Weninger et al., 2010) and the content
ing on manual design and tended to be brittle and extraction via text density algorithms (Sun et al.,
hard to maintain (Crescenzi et al., 2001). These 2011). Statistical selection of informative nodes
extraction procedures have also been used early through a combination of both methods proved
on by blogs search engines (Glance et al., 2004). moreefficientoncomparabledatasets(Qureshiand
Since the genre of “web diaries” was established Memon,2012). The large majority of DOM-based
before the blogs in Japan, there have been attempts approaches try to leverage semantic information
to target not only blog software but also regular conveyedby HTMLtags,notablyparagraphs(p)on
pages (Nanno et al., 2004), in which the extraction whichtext-to-tag ratios are calculated (Carey and
of metadata also allows for a distinction based on Manic, 2016), or tag ratios and semantic features
heuristics. Regarding metadata extraction for pages from id and class attributes (Peters and Lecocq,
in article form and blogs in particular, common 2013).
targets include the title of the entry, the date, the Machine learning approaches have also been
author, the content, the number of comments, the used, whose interest generally consists in lever-
archived link, and the trackback link (Glance et al., aging advances in classification tasks by treating a
2004); they can also aim at comments specifically HTMLdocumentasaseriesofblockstobeclassi-
(Mishne and Glance, 2006). fied. Relevant algorithms include conditional ran-
domfieldslearning header, text, and noisy blocks
2.3 Generic web content extraction with markup-based, content-based, and document-
GenericextractiontechniquesgroundonDocument related features (Spousta et al., 2008), support vec-
Object Model (DOM) examination. An earlier, tor machines trained on linguistic, structural and
language-independent approach uses entropy mea- visual features (Bauer et al., 2007), Naive Bayes
sures applied to features, links, and content in order (Pasternack and Roth, 2009), multi-layer percep-
¨
to discriminate among parts of a web page (Kao tron based on paragraph-level features (Schafer
et al., 2004). Another notable technique, Visual and Bildhauer, 2012), or logistic regressions (Pe-
Page Segmentation, applies heuristics to find vi- ters and Lecocq, 2013). More recently, deep learn-
sually grouped blocks (Cai et al., 2003). Other ing has also been used for similar classifications,
methods are based on style tree induction, that is e.g. the Web2Text system is based on convolutional
detection of similarities of DOM trees on site-level neural networks learning combinations of DOM-
(Yi et al., 2003; Vieira et al., 2006). Overall, efforts based features (Vogels et al., 2018).
madetoautomaticallygeneratewrappershavebeen Despite the number of article on this topic, very
centered on three different approaches (Guo et al., few systems are open-source or freely available
2010): wrapper induction (e.g. building a grammar (Alarte et al., 2019).
to parse a web page), sequence labeling (e.g. la-
beled examples or a schema of data in the page), 2.4 Corpuslinguistics and NLP
and statistical analysis. This approach combined to There are few comparable projects coming from
the inspection of DOM tree characteristics (Wang the linguistics or natural language processing com-
et al., 2009; Guo et al., 2010) is a common ground munities and focused on making software publicly
to the information retrieval and computational lin- available and usable. Boilerpipe uses shallow text
guistics communities, with the categorization of features like word counts and link density with
HTMLelements and linguistic features (Ziegler ¨
and Skubacz, 2007) for the former and boilerplate decision tree and SVM classifiers (Kohlschutter
removal for the latter. et al., 2010). JusText is based on length heuristics
´
TheDOMconsidersagiven HTMLdocumentas as well as link and stop word densities (Pomikalek,
a tree structure whose nodes represent parts of the 2011). Both algorithms have been prevalent since
documenttobeoperatedon. Text, tag and/or link their release and are now mostly used through their
subsequent forks, as software needs to be kept up-
distiller to-date. More recent initiatives explicitly targeting
124
3 5
corpus creation feature the Corpus Crawler or retrieved from sources such as the CommonCrawl
4 ¨ 6
Texrex (Schafer, 2017), neither of which appears or the Internet Archive .
to be actively maintained. In addition, download utilities are included, no-
Anevaluation and discussion following from the tably using a multi-threaded but “polite” processing
Cleaneval initiative (Baroni et al., 2008) would put of URLqueues, i.e. time restrictions based on do-
the topic back into focus, as content processing on mainnames. Persistent connections are managed
the Web is affected by both time and geography. by a connection pool, thus maintaining connec-
This benchmark could be elaborated on, results are tions with websites to be scraped. The tool also
not consistent in different languages and metrics entails web crawling capacities which provide ac-
sometime fail to capture the variable influence of cessible and fail-safe ways to gather data based on
extractors on downstream modules (Lejeune and a series of target sites. First, support for sitemaps
Zhu, 2018). Often, tools are developed with partic- (XMLandTXTformats)accordingtothesitemap
ular page styles in mind, mostly from the English- protocol. Second, support for web feeds (ATOM,
speaking world (Barbaresi and Lejeune, 2020). For RDFandRSSformats)whichmakeitpossibleto
certain projects, customized scrapers which are ad- build a seamless news crawler. Third, crawling
justed to each website remain feasible (Krasselt components to discover content. It can also manip-
et al., 2020). A generic approach can really save ulate URL lists, including filtering and prioritiza-
humantimeandresources, albeit at a certain cost tion based on site characteristics or language-aware
in terms of accuracy depending on the context. heuristics based on internationalization.
The package provides a relatively light-weight
3 Introducing the Trafilatura tool and modular architecture, letting users choose the
componentstheywishtoinclude. Ithasbeentested
3.1 Features onLinux, MacOSandWindows,andcanbeused
Trafilatura is a web scraping tool for text discovery with Python, on the command-line, with R (us-
and retrieval which seamlessly downloads, parses, ing the reticulate adapter package), and through a
and scrapes web page data. It can crawl and dis- graphical user interface. The package documenta-
7
cover texts within a website and process them ac- tion also acts as a manual on web text collection.
cordingly. The extractor focuses on metadata, main 3.2 Extraction process
body text and comments while preserving parts
of the text formatting and page structure. It aims The extraction combines two acknowledged li-
8 9
to be precise enough in order not to miss texts or braries, readability-lxml and jusText , which are
to discard valid documents, as it must be robust used as safety nets and fallbacks. Trafilatura’s own
but also reasonably fast. With these objectives in extraction algorithm is based on a cascade of rule-
mind, Trafilatura is designed to run in production based filters and content heuristics:
on millions of web documents. (1) Content delimitation is performed by XPath ex-
Thesoftware features parallel online and offline pressions targeting common HTML elements and
processing: URLs, HTML files or parsed HTML attributes as well as idiosyncrasies of main content
trees can be used as input. Although straight out- management systems, first in a negative perspec-
put of Python variables is possible, conversion to tive with the exclusion of unwanted parts of the
various common output formats makes the soft- HTMLcode(e.g. ) and next
ware more versatile: plain text (minimal format- bycenteringonthedesirablecontent(e.g. ). The same operations are
JSON(with metadata), XML and XML-TEI (for performed for comments in case they are part of
metadata and structure). The latter support for TEI the extraction. The selected nodes of the HTML
format (following the recommendations of the Text tree are then processed, i.e. checked for relevance
Encoding Initiative) also includes a validator for (notably by element type, text length and link den-
Pythonwhichcanbeusedapartfromtheextraction. sity) and simplified as to their HTML structure.
Thescraping and conversion parts also work with 5https://commoncrawl.org/
existing archives, Raw HTML documents can be 6https://archive.org/
7https://trafilatura.readthedocs.io/
3https://github.com/google/corpuscrawler 8https://github.com/buriy/python-readability
4https://github.com/rsling/texrex 9https://github.com/miso-belica/jusText
125
no reviews yet
Please Login to review.