Lexibench: Towards an Improved Collection of Benchmark Data for Computational Historical Linguistics

Authors

DOI:

https://doi.org/10.15475/calcip.2025.1.2

Keywords:

dataset, computational historical linguistics, benchmark data, cognate detection

Abstract

Computational approaches in historical linguistics have made great progress during the past two decades. As of now, it is much more common to propose subgroupings based on phylogenetic analyses than on traditional considerations using shared innovations. We have also seen a drastic increase in openly available datasets that share cognate judgments for various language families. Thanks to new standardization efforts providing facilitated access to several dozen comparative wordlists, it seems about time to work on on improved benchmarks of manually annotated cognates in computational historical linguistics. In this study, a first effort of this kind is undertaken, by presenting Lexibench, a preliminary gold standard for computational historical linguistics. Lexibench builds on the Lexibank repository to extract 63 multilingual wordlists, all manually annotated for cognacy, that can be used to assess the quality of cognate detection and phylogenetic reconstruction methods in computational historical linguistics.

Downloads

Published

2025-02-24

How to Cite

Häuser, L., & List, J.-M. (2025). Lexibench: Towards an Improved Collection of Benchmark Data for Computational Historical Linguistics. Computer-Assisted Language Comparison in Practice, 8(1). https://doi.org/10.15475/calcip.2025.1.2