Computer-Assisted Language Comparison in Practice: Tutorials on Computational Approaches to the History and Diversity of Languages

Towards a Specification for the Formal Notation of Sound Laws

Johann-Mattis List — 2026-07-15

Formal notations for the handling of sound laws in historical linguistics have been employed by linguists for a long time. So far, however, linguists have barely tried to describe their notation practices formally. As a result, a lot of variation can be found in the practice of sound law notation. This specification can be seen as a first attempt to provide an exhaustive documentation for a particular sound law notation scheme. This notation scheme accompanies an online-tool in which sound laws can be applied to actual words in order to test the consequences of sound change in practice.

Integrating Croatian into Concepticon: a Corpus-Based Frequency Mapping of Croatian Vocabulary

Anja Krišto — 2026-06-08

This study presents a Croatian frequency-derived wordlist mapped to Concepticon concept sets, based on the most frequent nouns, verbs, and adjectives extracted from the hrWaC web corpus. The resulting dataset connects corpus-based Croatian vocabulary to Concepticon's cross-linguistic framework and includes lexicalizations from nine additional languages for each mapped item.

How To Visualize Language Polygons with QGIS (How to Do X in Linguistics 15)

Frederic Blum — 2026-05-04

This tutorial shows how to build beautiful maps using polygon data of language distributions and the geospatial software QGIS. For this purpose, I show in the first part how to extract a list of Glottocodes from CLDF datasets and how to use command-line tools to extract the polygon data from Glottography datasets. The second part focuses on the visualization of the extracted data with QGIS. In this part, I also show to include non-linguistic data like archaeological sites and how to prepare the map for printing. The tutorial is fully implemented with free and open software and intends to make map-making accessible for linguists and other researchers.

CLICS⁴ as CLLD Web Application

Annika Tjuka — 2026-04-29

The fourth version of the Database of Cross-Linguistic Colexifications (https://clics.clld.org) was published last year. Now, we launched the accompanying web application that presents the data for interactive inspection and exploration. The study presents the application in brief and discusses also the development of cross-linguistic colexification databases in the future.

Foundations of Formal Etymological Analysis

Johann-Mattis List — 2026-03-24

This study gives a brief overview on formal aspects of etymological analysis, by providing a modified workflow for the classical comparative method in historical language comparison. This workflow is contrasted with the current state-of-the-art in computational historical linguistics, pointing out where computational methods and interactive tools for annotation are lacking, and where they are available already.

Computing Detailed Colexifications with Missing Data Information from the CLICS⁴ Collection

David Snee — 2026-02-23

CLICS⁴ offers a refined structural representation of cross-linguistic colexification patterns but retains an implicit representation of missing data. This obscures whether the lack of a colexification in a language for paired concepts is due to its true absence in the language, or due to missing data on the concept or word form level. We introduce a straightforward workflow that can be applied to individual datasets from CLICS⁴ to identify cases of colexification via a three-way attestation scheme. Our approach captures the presence or absence of a colexification in CLICS⁴, but it also explicitly encodes the presence or absence of data at the level of the original questionnaire, or the individual language, elicited with the help of the questionnaire.

Transparent Application of Text Generation Tools in Scientific Research

Johann-Mattis List — 2026-01-26

In this opinion piece, I share my view on the application of language models and text generation services in scientific research. In my opinion, scientific research that lives up to the promises of open science must provide full documentation of all prompts and exchanges that were used to create a given study. A mere mention that AI tools have been used in study design, writing, or coding is not enough.

Towards a Unified ConversionTable for Semitic Transcriptionsand Transliterations

Carlo Meloni — 2025-12-17

In this study we present a preliminary conversion table that can be used for transcriptions and transliterations across different Semitic languages. We introduce the basic idea behind the table, show how it can be used, and explain how we hope to expand it in the future.

Standardizing Phonetic Transcriptions for Kitchen et al.'s Comparative Wordlist on Semitic Languages with Language-Specific Orthography Profile

Ben Sapirstein — 2025-11-10

Comparative wordlists are a fundamental tool for tracing language history, allowing us to see how languages are related, much like biologists use DNA to infer phylogenies of species. When linguists compile data from different sources, scholars often code lexical data differently, using individual transcription systems that cannot be directly compared with each other. In order to make such data comparable, individual transcription systems must be unified in order to reflect a common standard. This study illustrates how such unification can be done by taking a particularly diverse dataset on Semitic languages as example and illustrating how transcriptions for individual language varieties can be harmonized as part of the general standardization workflow proposed by the Cross-Linguistic Data Formats initiative.

Manipulating Lexical Forms with the PyLexibank FormSpec

Johann-Mattis List — 2025-10-28

Multilingual lexical data is typically stored in a wide variety of forms, based on many idiosyncratic decisions that vary from dataset to dataset. Here, a simple but efficient solution for the manipulation of lexical data in multilingual wordlists will be introduced. This solution, the PyLexibank FormSpec, was originall developed for the conversion of various kinds of lexical data to Cross-Linguistic Data Formats, but it can also be used as a standalone. This study offers a basic tutorial that illustrates how the FormSpec can be put to concrete use.

Integrating Semantic Embeddings into NoRaRe

Arne Rubehn — 2025-09-17

This study illustrates how semantic embeddings can be added to and retrieved from NoRaRe. By that, it provides a template for handling vector data and makes popular methodology in semantic modeling available for cross-linguistic comparison.

Illustrating Data Curation in NoRaRe with the Help of Templates

Johann-Mattis List — 2025-08-25

This study introduces a collection of templates that can be used to contribute data to the Database of Norms, Ratings, and Relations (NoRaRe) of words and concepts. The templates are intended to facilitate the process of dataset conversion and serve as a starting point for those who are interested to contribute data to the catalog. A first template structure with two sample datasets is introduced and discussed in more detail, pointing to those aspects of data curation that may lead to confusion among users who contribute the first time to the NoRaRe database.

Digitizing Legacy Lexical Data of Muishaung for Computer-Assisted Language Comparison

Kellen Parker van Dam — 2025-07-23

This study describes the process of digitizing legacy materials into a computer-readable format for the purposes of computational typology and computer-assisted historical reconstruction. It presents a comparative wordlist that is made available in the formats recommended by the Cross-Linguistic Data Formats initiative.

Handling Non-Standard Datasets in NoRaRe: A Practical Guide

Mira Ahmedović — 2025-03-12

NoRaRe, the Database of Cross-Linguistic Norms, Ratings, and Relations, is a resource that curates multiple datasets containing information on various properties of words and concepts. When researchers contribute their data, the format and structure can vary widely, presenting challenges for seamless integration. Here, I offer practical guidance for addressing common issues such as data being placed in different sheets, headers in unexpected rows, or datasets contained within zip-files. The strategies shared here offer a foundational approach to understanding and adapting NoRaRe’s flexibility to accommodate the idiosyncrasy of each dataset.

Extracting Transparent Compounds from Lexibank

Johann-Mattis List — 2025-05-26

Many languages make use of transparent compounding processes in order to express certain words in their lexicon. With time, these processes can loose their transparency, making them hard to detect automatically. With large data collections simple tests can be designed to detect transparent compounds and investigate their distribution. This study illustrates how a very rudimentary analysis of cross-linguistically recurring transparent compound patterns can be applied to Lexibank data with a few lines of Python code.

PyLexibench — Generating Data for Lexibench with a Python Package

Luise Häuser — 2025-04-22

With PyLexibench we introduce a small Python package that can be used to populate the Lexibench benchmark for computational historical linguistics with benchmark data. Here, we introduce the package and show how it helps to access and expand Lexibench. We also introduce new data for character matrices in various forms and formats and lay out how we intend to use the package to manage Lexibench releases in the future.

How to Run EDICTOR 3 Locally

Frederic Blum — 2025-01-27

EDICTOR3 offers many ways of comparing language data with computer-assisted methods. This study offers a short overview of how to run EDICTOR3 locally, without the need for uploading the data to a server or being connected to the internet, while maintaining all the functionalities. In a first step, we will show how one can download a Lexibank dataset and create different types of files that one can use with EDICTOR. We will then proceed to present the possibility of running an EDICTOR server locally and to edit the dataset that one has downloaded.

Lexibench: Towards an Improved Collection of Benchmark Data for Computational Historical Linguistics

Luise Häuser — 2025-02-24

Computational approaches in historical linguistics have made great progress during the past two decades. As of now, it is much more common to propose subgroupings based on phylogenetic analyses than on traditional considerations using shared innovations. We have also seen a drastic increase in openly available datasets that share cognate judgments for various language families. Thanks to new standardization efforts providing facilitated access to several dozen comparative wordlists, it seems about time to work on on improved benchmarks of manually annotated cognates in computational historical linguistics. In this study, a first effort of this kind is undertaken, by presenting Lexibench, a preliminary gold standard for computational historical linguistics. Lexibench builds on the Lexibank repository to extract 63 multilingual wordlists, all manually annotated for cognacy, that can be used to assess the quality of cognate detection and phylogenetic reconstruction methods in computational historical linguistics.

Making a Lexibank Dataset from Lee’s “Phonological Features of Caijia” from 2023

Johann-Mattis List — 2025-06-25

Caijia is a very interesting Sino-Tibetan language variety. It has been documented only recently, it seems to belong to the Sinitic branch of Sino-Tibetan, but shows some archaic features that have led to some controversies among scholars regarding its proper affiliation, and detailed comparative analyses of the language in comparison with other Sino-Tibetan languages are still in their infancy. This little study demonstrates how a first published wordlist of Caijia (Lee 2023) can be prepared for the inclusion in the Lexibank repository.

Typing Special Characters as a Key Skill for Linguists

Johann-Mattis List — 2024-11-04

Most linguists have to type special characters that are not available on an ordinary keyboard on a regular basis. Reflecting about the general problems involved in typing special characters, I review different solutions and argue that linguists should not only be able to type special characters on their computers, but that they should also have some basic knowledge about their technical aspects and know how to expand and customize them. In order to improve the training of young scholars, it is important to discuss special character typing more openly in linguistics, especially in the classroom and with doctoral students, sharing individual solutions openly.

Generating Phonological Feature Vectors with SoundVectors and CLTS

Arne Rubehn — 2024-08-05

The recently published Python library soundvectors offers a simple and robust method to derive phonological feature vectors for any valid IPA sound via its canonical description. It is designed to interact neatly with the Cross-Linguistic Transcription Systems reference catalog (CLTS), which dynamically parses valid strings in phonetic transcription to describe speech sounds. This study illustrates how both systems can be used together to generate phonological feature vectors for all kinds of sounds without relying on a previously defined lookup table. Additionally, it compares the generated feature vectors with those obtained from two other prominent databases, PanPhon and PHOIBLE, showing how those systems can be accessed from the CLTS data via its Python API pyclts.

Preparing Acoustic Pitch Data for Computational Analysis and Presentation

Kellen Parker van Dam — 2024-10-07

Pitch plays an important role in many linguistic systems. It is the primary set of features which determine vowel quality distinctions as well as forming the basis for intonation and contrastive tone systems. Unfortunately, much of the literature has relied on approaches to presenting and analysing pitch data that can result in a lack of data transparency, reproducibility, and analytical robustness. These issues are easily solved through the selection of a more appropriate scale for pitch values. This study presents the issues with using raw pitch data as Hertz values some historical efforts to resolve these issues, and two more appropriate solutions than some of the more widely used systems, with a way to easily calculate these alternative systems in a short Python script.

Using CLDFBench and PyLexibank on Windows

David Snee — 2024-12-18

Using tools such as CLDFBench and PyLexibank, datasets can be converted into Cross-Linguistic Data Formats (CLDF), offering a standardized and interoperable representation of linguistic data. While these tools are powerful, lifting datasets to CLDF can present unique challenges for Windows users due to idiosyncrasies in the Windows operating system. Although CLDFBench and PyLexibank are compatible with Windows, certain workarounds may be necessary to address system-specific issues. This guide aims to demonstrate how CLDFBench and PyLexibank can be effectively installed and used on a Windows computer to lift a dataset to CLDF.

Converting an Artificial Proto-Language into Data for Testing Computational Approaches in Historical Linguistics

Johann-Mattis List — 2024-07-17

This small study shows how data for an artificially created language that was supposed to reflect features of "proto-languages", predating modern languages by several thousand years, can be used in testing computational approaches in historical linguistics. In order to do so, computational workflow is described that retrieves the data automatically, creating a comparative wordlist compatible in format with software tools for historical linguistics, and then uses a baseline method for automatic cognate detection to compare an artificial language against a sample of Indo-European languages. The results show that artificial languages might help to fill a gap in testing that has so far been ignored in the literature.

Adding Standardized Transcriptions to Panoan and Tacanan Languages in the Intercontinental Dictionary Series

John Miller — 2024-09-02

In this study, we illustrate how standardized phonetic transcriptions can be added to the data for Panoan and Tacanan languages provided by the Intercontinental Dictionary Series. The result is presented as a new dataset that keeps reference to the original data and adds phonetic transcriptions for each word form in Panoan languages, Tacanan languages, as well as Spanish and Portuguese.

Past and Future of Computer-Assisted Language Comparison in Practice

Johann-Mattis List — 2024-01-17

Our blog "Computer-Assisted Language Comparison in Practice" goes into its seventh year. We reflect on the role the blog played in the past and present and new goals and concrete ideas for the future. The most drastic innovation we initiated is to turn the blog into an open journal, which means that all future and successively also past contributions will be archived in PDF format with digital object identifiers.

How to Visualize Colexification Networks in Cytoscape (How to Do X in Linguistics 14)

Annika Tjuka — 2024-02-19

The ability to visualize data in an intelligible way is an important skill for scientists. In linguistics, especially in lexical semantics, data are often visualized using graphs, i.e., networks. For example, in the web app for the Database of Cross-Linguistic Colexifications (CLICS), we use networks to illustrate that a lexical form refers to two different concepts by connecting the concepts (i.e., nodes) with a line (i.e., edge). When identifying the colexifications between concepts across a large number of languages, the network grows and a tool to visualize multiple data points becomes necessary. Here, I present a tutorial for the first steps to visualize a colexification network with Cytoscape. The tutorial is intended for beginners who want to learn how the tool works and serves as a starting point for further skill development.

A New Python Library for the Manipulation and Annotation of Linguistic Sequences

Robert Forkel — 2024-03-25

The Python package linse (https://pypi.org/project/linse) offers various methods for the manipulation and annotation of sequences. In this short overview, we summarize its major functionalities and provide some information on its background and how we intend to develop it further in the future.

Implementing Fuzzy Spelling Search in Dictionaries of Under-Described Languages Lacking Standard Orthographies

Kellen Parker van Dam — 2024-05-27

Non-standard orthographies are common in the world of under-described language documentation. Whether they are semi-conventionalised community spellings, orthographies partially adopted from missionary works, or hastily transcribed texts representing as-yet uncertain phonologies, there is a need to be able to work through lexical data in a way which can accommodate and respond to such non-standard transcriptions. Here, a few options are considered, with a solution for fuzzy string matching based on attested variations is presented.

Representing the Database of Semantic Shifts by Zalizniak et al. from 2024 in Cross-Linguistic Data Formats

Katja Bocklage — 2024-04-24

In this brief study, we show how the Database of Semantic Shifts, a large resource on semantic change and semantic motivation, can be represented in Cross-Linguistic Data Formats. The representation allows for a convenient quantitative analysis of the numerous annotations on semantic change and semantic motivation and for the integration of the database with additional resources on semantic change and semantic motivation that have been compiled independently in the last years.