INEL Nenets Speech Corpus

License icon

License:

CC-BY-NC-SA-4.0

Shield icon

Steward:

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

Task: ASR

Release Date: 3/24/2026

Format: TSV, MP3

Size: 8.35 MB


Share

Description

This dataset is a machine-learning-ready subset of the INEL Nenets Corpus (Version 1.0), processed specifically for Automatic Speech Recognition (ASR) / Speech-to-Text (STT) training. It translates the detailed EXMARaLDA XML annotations into the standard tabular layout utilized by Mozilla Common Voice. The dataset comprises over 36 minutes of aligned supervised speech data (447 individual clips). It strictly populates the primary \'sentence\' column using the \'st\' tier (Cyrillic source transcription) to ensure orthographic accuracy, with demographic metadata included where available.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

The source corpus (Budzisch, Josefina; Wagner-Nagy, Beáta. 2024. INEL Nenets Corpus. Version 1.0.) must be cited in all derivative works and research using this dataset.

Forbidden Usage

None

Metadata

INEL Nenets Speech Corpus

Overview

This dataset is a machine-learning-ready subset of the INEL Nenets Corpus (Version 1.0), processed specifically for Automatic Speech Recognition (ASR) / Speech-to-Text (STT) training.

About the Nenets People and Language

Nenets (ISO 639-3: `yrk`) belongs to the Samoyedic branch of the Uralic language family. It is spoken from the Kanin Peninsula in Europe to the Taimyr Peninsula in Northwestern Siberia. There are two main lects represented: Tundra Nenets and Forest Nenets.

Statistics

  • Language: Nenets (`yrk`)

  • Total Speakers: 1

  • Total Audio Clips: 447 sentences

  • Total Audio Duration: 00:36:53 (HH:MM:SS)

  • Audio Format: MP3

Data Format

The dataset provides audio clips aligned at the sentence level alongside a `train.tsv` metadata file.

The `train.tsv` contains the following Common Voice conforming fields:

  • `client_id`: Unique speaker abbreviation.

  • `path`: The filename of the corresponding audio clip.

  • `sentence`: The primary transcribed text of the audio.

  • `age`: The calculated age of the speaker at the time of recording.

  • `gender`: `male` or `female`.

  • `accents`: Specific dialect information (e.g., Tundra or Forest dialects).

  • `locale`: Locale code (`yrk`).

Understanding the Text Columns

To ensure the highest orthographic accuracy for training, this dataset strictly populates the primary `sentence` column using the `st` tier (Cyrillic source transcription) from the original EXMARaLDA files.

Note: To maintain a streamlined, single-task focus on Speech-to-Text modeling, extensive linguistic annotations such as morphological glossing and secondary translations (English, Russian) have been excluded from this specific audio dataset.

Dataset Alphabet

The following represents all unique characters encountered in the `sentence` column of this dataset. They are sorted by uppercase, then lowercase, and finally all other symbols:

А В Д И К Л М Н О П С Т Х Ч Ш Ю Я Ӈ x ă а б в г д е ж з и й к л м н о п р с т у ф х ц ч ш щ ы ь э ю я ё ӈ ӗ ӭ ԓ ! ( ) , - . : ? « » ̆ – ’ “ ” …

Copyright, License & Required Citation

Created within the long-term INEL project ("Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages").

License: CC BY-NC-SA 4.0

Required Citation: If you use this dataset in your research, training runs, or derivative works, please cite the source corpus as follows:

Budzisch, Josefina; Wagner-Nagy, Beáta. 2024. INEL Nenets Corpus. Version 1.0. Publication date 2024-12-31. https://hdl.handle.net/11022/0000-0007-FE37-E. Archived at Universität Hamburg. In: The INEL corpora of indigenous Northern Eurasian languages. https://hdl.handle.net/11022/0000-0007-F45A-1