INEL Selkup Speech Corpus

License icon

License:

CC-BY-NC-SA-4.0

Shield icon

Steward:

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

Task: ASR

Release Date: 3/24/2026

Format: TSV, MP3

Size: 45.46 MB


Share

Description

This dataset is a machine-learning-ready subset of the INEL Selkup Corpus (Version 2.0), processed specifically for Automatic Speech Recognition (ASR) / Speech-to-Text (STT) training. It translates the highly detailed EXMARaLDA XML annotations into the standard tabular layout utilized by Mozilla Common Voice. The dataset comprises 1 hour and 39 minutes of aligned supervised speech data (1,286 individual clips) across 15 speakers, largely originating from the 1960s and 1970s archive of linguist Angelina Kuzmina. It features demographic metadata where available, and prioritizes Cyrillic transcriptions while falling back to Latin or Phonological tiers to ensure complete text coverage for acoustic modeling.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

The source corpus (Brykina, Maria; Orlova, Svetlana; Wagner-Nagy, Beáta. 2021. "INEL Selkup Corpus." Version 2.0.) must be cited in all derivative works and research using this dataset.

Forbidden Usage

None

Metadata

INEL Selkup Speech Corpus

Overview

This dataset is a machine-learning-ready subset of the INEL Selkup Corpus (Version 2.0), processed specifically for Automatic Speech Recognition (ASR) / Speech-to-Text (STT) training under the Mozilla Data Collective initiative. It translates the highly detailed EXMARaLDA XML annotations into the standard tabular layout utilized by Mozilla Common Voice.

The source material largely originates from the archive of the Russian linguist Angelina Kuzmina, who worked extensively with native speakers of different Selkup dialects in the 1960s and 1970s.

About the Selkup People and Language

Selkup (ISO 639-3: `sel`) belongs to the Samoyedic branch of the Uralic language family. It is spoken in Western Siberia, between the Ob and Yenisei rivers in the Yamalo-Nenets Autonomous Okrug, Krasnoyarsk Krai, and Tomsk Oblast. The dialectal division consists of Northern, Central, Southern, and Ket dialects.

Selkup is critically endangered. While over 3,600 people identified as Selkups in the 2010 census, the language is spoken or understood by only a few dozen people, mostly native speakers of the Northern dialects, with the other varieties being almost extinct.

Statistics

  • Language: Selkup (`sel`)

  • Total Speakers: 15

  • Total Audio Clips: 1286 sentences

  • Total Audio Duration: 01:39:24 (HH:MM:SS)

  • Audio Format: MP3

Data Format

The dataset provides audio clips aligned at the sentence level alongside a `train.tsv` metadata file.

The `train.tsv` contains the following Common Voice conforming fields:

  • `client_id`: Unique speaker abbreviation.

  • `path`: The filename of the corresponding audio clip.

  • `sentence`: The primary transcribed text of the audio.

  • `age`: The calculated age of the speaker at the time of recording.

  • `gender`: `male` or `female`.

  • `accents`: Specific dialect information (e.g., Taz, Upper Tolka, Narym, Ket).

  • `locale`: Locale code (`sel`).

Understanding the Text Columns

The original corpus contains multiple transcription tiers. To ensure every audio clip has a primary transcription for STT training, the `sentence` column is dynamically populated using a strict fallback hierarchy:

  1. `st`: Source transcription (Cyrillic orthography).

  2. `stl`: Source transcription (Latin script).

  3. `ts`: Phonological transcription (IPA-like characters).

Because the `sentence` field can contain Cyrillic, Latin, or Phonological text depending on the original file's availability, the raw `st`, `stl`, and `ts` columns are also preserved in the TSV. To programmatically determine which alphabet/tier was used for the `sentence` column on any given row, consumers can simply test for equality: `sentence == st`, `sentence == stl`, or `sentence == ts`.

Note: To maintain a streamlined, single-task focus on Speech-to-Text modeling, extensive linguistic annotations such as morphological glossing and secondary translations (English, German, Russian) have been excluded from this specific audio dataset.

Dataset Alphabet

The following represents all unique characters encountered in the `sentence` column of this dataset. They are sorted by uppercase, then lowercase, and finally all other symbols:

А Б В Г Д Е И К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ы Э Я Ѳ Ҷ Ӄ Ӣ Ӧ Ө Ӱ i j k l o w ä ö ō ǝ ə ɛ ɣ а б в г д е ж з и й к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я ё і ѳ қ ң ҳ ҷ ӄ ӈ ӓ ә ӣ ӧ ө ӯ ӱ ӷ ! ( ) , - . / 0 1 2 3 4 5 6 7 8 9 : ? « » ʼ ́ ̂ ̃ ̄ ̊ ̨ – — … ′ ‵

Copyright, License & Required Citation

Created within the long-term INEL project ("Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages").

License: CC BY-NC-SA 4.0

Required Citation: If you use this dataset in your research, training runs, or derivative works, please cite the source corpus as follows:

Brykina, Maria; Orlova, Svetlana; Wagner-Nagy, Beáta. 2021. "INEL Selkup Corpus." Version 2.0. Publication date 2021-12-31. https://hdl.handle.net/11022/0000-0007-F4D9-1. Archived at Universität Hamburg. In: The INEL corpora of indigenous Northern Eurasian languages. https://hdl.handle.net/11022/0000-0007-F45A-1