A Dataset of Latin Etymologies Extracted from Wiktionary

Javier de Torres; Marco Passarotti; Giovanni Moretti; Francesco Mambrini; Matteo Pellegrini

Back

A Dataset of Latin Etymologies Extracted from Wiktionary

Conference proceeding

Open access

Peer reviewed

A Dataset of Latin Etymologies Extracted from Wiktionary

Javier de Torres, Marco Passarotti, Giovanni Moretti, Francesco Mambrini and Matteo Pellegrini

Proceedings of the 11th Edition of the Swiss Text Analytics Conference, pp.226-233

10/06/2026

Abstract

We present a curated resource of Latin etymologies automatically extracted from Wiktionary, enriched with links to the LiLa Knowledge Base of Latin and modelled as RDF triples using the LemonEty ontology. We also present the Python pipeline the data was generated with, as it can be reused to extract Wiktionary's etymologies for other languages. The etymology chains cover Latin words and their attested or reconstructed ancestors in languages such as Proto-Indo-European, Proto-Italic, Ancient Greek, Hebrew, Egyptian, and others. To address the structural noise and editorial hetero-geneity of Wiktionary etymology data, we have introduced strong rule-based filters throughout the pipeline, especially in the curation stage. After validation, the resulting dataset contains etymological chains for 9,684 lemmas, which can be used to support research in Historical Linguistics, Computational Etymology and language learning, among other applications.

Files and links (2)

pdf

De Torres et al. (2026)175.48 kBDownload View

Author's Accepted Manuscript Open Access

url

https://www.swisstext.org/View

Event Website Conference website

Metrics

1 Record Views

Details

Title: A Dataset of Latin Etymologies Extracted from Wiktionary
Creators: Javier de Torres (Author)
Marco Passarotti (Author) - Università Cattolica del Sacro Cuore
Giovanni Moretti (Author) - Università Cattolica del Sacro Cuore
Francesco Mambrini (Author) - Università Cattolica del Sacro Cuore
Matteo Pellegrini (Author) - University of Surrey, Literature and Languages
Publication Details: Proceedings of the 11th Edition of the Swiss Text Analytics Conference, pp.226-233
Publisher: Association for Computational Linguistics (ACL)
Identifiers: 991140196402346
Academic Unit: Literature and Languages
Language: English
Resource Type: Conference proceeding

A Dataset of Latin Etymologies Extracted from Wiktionary

Abstract

Files and links (2)

Metrics

Details

Usage Policy