Logo image
A Dataset of Latin Etymologies Extracted from Wiktionary
Conference proceeding   Open access   Peer reviewed

A Dataset of Latin Etymologies Extracted from Wiktionary

Javier de Torres, Marco Passarotti, Giovanni Moretti, Francesco Mambrini and Matteo Pellegrini
Proceedings of the 11th Edition of the Swiss Text Analytics Conference, pp.226-233
10/06/2026

Abstract

We present a curated resource of Latin etymologies automatically extracted from Wiktionary, enriched with links to the LiLa Knowledge Base of Latin and modelled as RDF triples using the LemonEty ontology. We also present the Python pipeline the data was generated with, as it can be reused to extract Wiktionary's etymologies for other languages. The etymology chains cover Latin words and their attested or reconstructed ancestors in languages such as Proto-Indo-European, Proto-Italic, Ancient Greek, Hebrew, Egyptian, and others. To address the structural noise and editorial hetero-geneity of Wiktionary etymology data, we have introduced strong rule-based filters throughout the pipeline, especially in the curation stage. After validation, the resulting dataset contains etymological chains for 9,684 lemmas, which can be used to support research in Historical Linguistics, Computational Etymology and language learning, among other applications.
pdf
De Torres et al. (2026)175.48 kBDownloadView
Author's Accepted Manuscript Open Access
url
https://www.swisstext.org/View
Event Website Conference website

Metrics

1 Record Views

Details

Logo image

Usage Policy