INEL Enets Speech Corpus
License:
CC-BY-NC-SA-4.0
Steward:
Institute of Finno-Ugric/Uralic Studies, University of HamburgTask: ASR
Release Date: 3/24/2026
Format: TSV, MP3
Size: 140.56 MB
Share
Description
This dataset is a machine-learning-ready subset of the INEL Enets Corpus (Version 1.1), processed specifically for Automatic Speech Recognition (ASR) / Speech-to-Text (STT) training. It translates the highly detailed EXMARaLDA XML annotations into the standard tabular layout utilized by Mozilla Common Voice. The dataset comprises 3 hours and 41 minutes of aligned supervised speech data (3,755 individual clips) across 23 speakers. It features demographic metadata where available, and prioritizes Cyrillic transcriptions while falling back to Latin or Phonological tiers to ensure complete text coverage for acoustic modeling.
Specifics
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
The source corpus (Shluinsky, Andrey; Khanina, Olesya; Wagner-Nagy, Beáta. 2025. INEL Enets Corpus. Version 1.1.) must be cited in all derivative works and research using this dataset.
Forbidden Usage
None
Metadata
INEL Enets Speech Corpus
Overview
This dataset is a machine-learning-ready subset of the INEL Enets Corpus (Version 1.1), processed specifically for Automatic Speech Recognition (ASR) / Speech-to-Text (STT) training under the Mozilla Data Collective initiative. It translates the highly detailed EXMARaLDA XML annotations into the standard tabular layout utilized by Mozilla Common Voice.
About the Enets People and Language
Enets belongs to the Samoyedic group of the Uralic language family and is spoken in the western part of the Taymyr Peninsula in Central Siberia. There are two distinct Enets lects: Tundra Enets (ISO 639-3: `enh`) and Forest Enets (ISO 639-3: `enf`), both of which are represented in this corpus (with Forest Enets making up the majority of the data).
Both lects are critically endangered. The intergenerational transfer of the language has been broken, and full command of Enets is kept by a maximum of only 25-30 people overall, all of whom belong to the elder generation.
Statistics
Language: Enets (`enf` / `enh`)
Total Speakers: 23
Total Audio Clips: 3755 sentences
Total Audio Duration: 03:41:47 (HH:MM:SS)
Audio Format: MP3
Data Format
The dataset provides audio clips aligned at the sentence level alongside a `train.tsv` metadata file.
The `train.tsv` contains the following Common Voice conforming fields:
`client_id`: Unique speaker abbreviation.
`path`: The filename of the corresponding audio clip.
`sentence`: The primary transcribed text of the audio.
`age`: The calculated age of the speaker at the time of recording.
`gender`: `male` or `female`.
`accents`: Specific dialect information (e.g., Forest Enets or Tundra Enets).
`locale`: Locale code (extracted dynamically per speaker; e.g., `enf` or `enh`).
Understanding the Text Columns
The original corpus contains multiple transcription tiers. To ensure every audio clip has a primary transcription for STT training, the `sentence` column is dynamically populated using a strict fallback hierarchy:
`st`: Source transcription (Cyrillic orthography).
`stl`: Source transcription (Latin script).
`ts`: Phonological transcription (IPA-like characters).
Because the `sentence` field can contain Cyrillic, Latin, or Phonological text depending on the original file's availability, the raw `st`, `stl`, and `ts` columns are also preserved in the TSV. To programmatically determine which alphabet/tier was used for the `sentence` column on any given row, consumers can simply test for equality: `sentence == st`, `sentence == stl`, or `sentence == ts`.
Note: To maintain a streamlined, single-task focus on Speech-to-Text modeling, extensive linguistic annotations such as morphological glossing and secondary translations (English, German, Russian) have been excluded from this specific audio dataset.
Dataset Alphabet
The following represents all unique characters encountered in the `sentence` column of this dataset. They are sorted by uppercase, then lowercase, and finally all other symbols:
Ç Ŋ Ɛ А Б В Г Д Е Ж З И Й К Л М Н О П Р С Т У Х Ч Ш Щ Ъ Э Я Ҫ Ӈ Ԑ a e f h i k n o t u x z ç ô ŋ ȯ ɔ ɛ ʲ ˀ ε а б в г д е ж з и й к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я ё ҫ ӈ ӡ ӣ ԑ ! " ' ( ) , - . / 0 2 6 8 9 : ? ] | « » ʔ ʺ ́ ̄ – — ’ “ ” …
Copyright, License & Required Citation
Created within the long-term INEL project ("Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages").
License: CC BY-NC-SA 4.0
Required Citation: If you use this dataset in your research, training runs, or derivative works, please cite the source corpus as follows:
Shluinsky, Andrey; Khanina, Olesya; Wagner-Nagy, Beáta. 2025. INEL Enets Corpus. Version 1.1. Publication date 2025-12-31. https://hdl.handle.net/11022/0000-0008-005C-1. Archived at Universität Hamburg. In: The INEL corpora of indigenous Northern Eurasian languages. https://hdl.handle.net/11022/0000-0007-F45A-1