INEL Kalmyk Speech Corpus

License icon

License:

CC-BY-NC-SA-4.0

Shield icon

Steward:

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

Task: ASR

Release Date: 3/24/2026

Format: TSV, MP3

Size: 138.31 MB


Share

Description

This dataset is a machine-learning-ready subset of the INEL Kalmyk Corpus (Version 1.0), processed specifically for Automatic Speech Recognition (ASR) / Speech-to-Text (STT) training. It translates the detailed EXMARaLDA XML annotations into the standard tabular layout utilized by Mozilla Common Voice. The dataset comprises 3 hours and 15 minutes of aligned supervised speech data (1,934 individual clips) across 26 speakers. It strictly populates the primary 'sentence' column using the 'ts' tier (scientific transcription) to ensure phonetic accuracy, with demographic metadata included where available.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

The source corpus (Baranova, Vlada. 2025. INEL Kalmyk Corpus. Version 1.0.) must be cited in all derivative works and research using this dataset.

Forbidden Usage

None

Metadata

INEL Kalmyk Speech Corpus

Overview

This dataset is a machine-learning-ready subset of the INEL Kalmyk Corpus (Version 1.0), processed specifically for Automatic Speech Recognition (ASR) / Speech-to-Text (STT) training.

About the Kalmyk People and Language

Kalmyk (ISO 639-3: xal) is a Mongolic language spoken primarily by the Kalmyk people in the Republic of Kalmykia (Russian Federation). The recordings in this corpus were collected between 2007 and 2018 in the Ketchenerovsky District and predominantly feature the Derbet and Torgut dialects.

Statistics

  • Language: Kalmyk (xal)

  • Total Speakers: 26

  • Total Audio Clips: 1934 sentences

  • Total Audio Duration: 03:15:58 (HH:MM:SS)

  • Audio Format: MP3

Data Format

The dataset provides audio clips aligned at the sentence level alongside a train.tsv metadata file.

The train.tsv contains the following Common Voice conforming fields:

  • client_id: Unique speaker abbreviation.

  • path: The filename of the corresponding audio clip.

  • sentence: The primary transcribed text of the audio.

  • age: The calculated age of the speaker at the time of recording.

  • gender: male or female.

  • accents: Specific dialect information (e.g., Derbet or Torgut dialect).

  • locale: Locale code (xal).

Understanding the Text Columns

The original INEL Kalmyk corpus does not provide transcriptions in Kalmyk Cyrillic. To ensure the highest phonetic accuracy for training, this dataset strictly populates the primary sentence column using the ts tier (scientific transcription) from the original EXMARaLDA files.

Future Release Note: We plan to update the sentence column in future releases to use the official Kalmyk Cyrillic orthography. When this update occurs, the original INEL scientific transcriptions will remain fully preserved in the ts column.

The raw INEL tier is also preserved in the TSV for reference:

  • ts: Scientific/phonological transcription (exact source for the current sentence column).

Note: To maintain a streamlined, single-task focus on Speech-to-Text modeling, extensive linguistic annotations such as morphological glossing and secondary translations (English, Russian) have been excluded from this specific audio dataset. A separate parallel-text dataset may be released in the future.

Dataset Alphabet

The following represents all unique characters encountered in the sentence column of this dataset. They are sorted by uppercase, then lowercase, and finally all other symbols:

A B C D E G I J K L M N O P R S T U V X Z Ä Ö Ü Č Š А Б В Г Д Е Ж З И К Л М Н О П Р С Т У Ф Х Ц Ч Ш Э Я a b c d e f g h i j k l m n o p q r s t u v x y z ä ö ü ă č ŋ š ž ǝ ǯ ə ʁ а б в г д е ж з и й к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я ё ә ! " % ( ) , - . 0 1 2 3 : ; = ? _ ʼ ̈ – — “ ” …

Copyright, License & Required Citation

Created within the long-term INEL project ("Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages").

License: CC BY-NC-SA 4.0

Required Citation: If you use this dataset in your research, training runs, or derivative works, please cite the source corpus as follows:

Baranova, Vlada. 2025. INEL Kalmyk Corpus. Archived at Universität Hamburg. Version 1.0. Publication date 2025-07-17. https://hdl.handle.net/11022/0000-0007-FFB1-2. In: The INEL Corpora of Indigenous Northern Eurasian Languages. https://hdl.handle.net/11022/0000-0007-F45A-1.