INEL Dolgan Speech Corpus
License:
CC-BY-NC-SA-4.0
Steward:
Institute of Finno-Ugric/Uralic Studies, University of HamburgTask: ASR
Release Date: 3/24/2026
Format: TSV, MP3
Size: 583.34 MB
Share
Description
This dataset is a machine-learning-ready subset of the INEL Dolgan Corpus (Version 2.0), processed specifically for Automatic Speech Recognition (ASR) / Speech-to-Text (STT) training. It translates the highly detailed EXMARaLDA XML annotations into the standard tabular layout utilized by Mozilla Common Voice. The dataset comprises 13 hours and 5 minutes of perfectly aligned supervised speech data (10,609 individual clips) across recordings spanning from the 1970s to 2017. It features demographic metadata where available, and prioritizes Cyrillic transcriptions while falling back to Latin or Phonological tiers to ensure complete text coverage for acoustic modeling.
Specifics
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
The source corpus (INEL Dolgan Corpus Version 2.0) must be cited in all derivative works and research using this dataset.
Forbidden Usage
None
Metadata
INEL Dolgan Speech Corpus
Overview
This dataset is a machine-learning-ready subset of the INEL Dolgan Corpus (Version 2.0), processed specifically for Automatic Speech Recognition (ASR) / Speech-to-Text (STT) training under the Mozilla Data Collective initiative. It translates the highly detailed EXMARaLDA XML annotations into the standard tabular layout utilized by Mozilla Common Voice.
The source material encompasses recordings spanning from the 1970s to 2017, including archival audio from the Taymyr House of Folk Art (TDNT) in Dudinka, as well as multiple linguistic fieldwork expeditions.
About the Dolgan People and Language
The Dolgans are the northernmost Turkic-speaking people in the world, primarily inhabiting the Taymyr Peninsula in the Russian Arctic. Their language, Dolgan, is closely related to Yakut (Sakha) but developed its own distinct characteristics due to geographic isolation and heavy linguistic influence from neighboring Evenki populations. With only a few thousand remaining speakers, Dolgan is considered a highly endangered language.
Beyond cultural preservation, this corpus provides clean, structured data designed to catalyze machine learning and natural language processing (NLP) research for low-resource languages. By making these recordings machine-learning-ready, we hope to enable researchers to train acoustic models, build speech technologies, and develop digital tools that will empower the community to actively learn, use, and revitalize the Dolgan language in the modern digital landscape.
Statistics
Language: Dolgan (
dlg)Total Audio Clips: 10609 sentences
Total Audio Duration: 13:05:27 (HH:MM:SS)
Audio Format: MP3
Data Format
The dataset provides audio clips aligned at the sentence level alongside a train.tsv metadata file.
The train.tsv contains the following Common Voice conforming fields:
client_id: Unique speaker abbreviation.path: The filename of the corresponding audio clip.sentence: The primary transcribed text of the audio.age: The calculated age of the speaker at the time of recording.gender:maleorfemale.accents: Specific dialect information (e.g., Upper/Lower dialect).locale: Locale code (dlg).
Understanding the Text Columns
The original corpus contains multiple transcription tiers. To ensure every audio clip has a primary transcription for STT training, the sentence column is dynamically populated using a strict fallback hierarchy:
st: Source transcription (Cyrillic orthography).stl: Source transcription (Latin script).ts: Phonological transcription (IPA-like characters).
Because the sentence field can contain Cyrillic, Latin, or Phonological text depending on the original file's availability, the raw st, stl, and ts columns are also preserved in the TSV. To programmatically determine which alphabet/tier was used for the sentence column on any given row, consumers can simply test for equality: sentence == st, sentence == stl, or sentence == ts.
Note: To maintain a streamlined, single-task focus on Speech-to-Text modeling, extensive linguistic annotations such as morphological glossing and secondary translations (English, German, Russian) have been excluded from this specific audio dataset. A separate parallel-text dataset may be released in the future.
Dataset Alphabet
The following represents all unique characters encountered in the sentence column of this dataset. They are sorted by uppercase, then lowercase, and finally all other symbols:
A B C D E F G H I J K L M N O P R S T U V X Z Ï Ö Ü Č Ń Š Ž Ɨ А Б В Г Д Е Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ы Ь Э Ю Я Ү Һ Ӈ Ө a b c d e f g h i j k l m n o p r s t u v w x y z ï ö ü č ń ŋ š ž ɨ ʒ γ χ а б в г д е ж з и й к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я ё ү һ ӈ ө ! " ' ( ) , - . / 3 : ; = ? [ ] _ ʼ ː ͡ – “ ” …
Copyright, License & Required Citation
The original corpus was created within the INEL project ("Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages") at the Universität Hamburg. The project was funded by the Academies' Programme (coordinated by the Union of the German Academies of Sciences and Humanities).
License: The data is provided under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
Required Citation: If you use this dataset in your research, training runs, or derivative works, please cite the source corpus as follows:
Däbritz, Chris Lasse, Kudryakova, Nina, & Stapert, Eugénie. (2022). INEL Dolgan Corpus (Version 2.0) [Data set]. http://doi.org/10.25592/uhhfdm.11165