Laari-TTS-Dataset

License:

NOODL-1.0

Steward:

Institute of African Digital Humanities

Task: ASR

Release Date: 12/12/2025

Format: WAV, TRJS, TSV

Size: 568.26 MB

Description

The dataset contains audio and text resources on Laari, a Bantu language spoken in the Congo. The resources, which are suitable for TTS tasks and possibly ASR tasks, consist of the following: - 6,311 audio clips totalling 241 minutes and 44.97 seconds; - an audio mapping file with 5,321 lines, each beginning with the name of an audio file, followed by a tab and then the corresponding text excerpt; - two raw audio files totalling 120 minutes and 54.90 seconds; - two long audio files with their original, non-split transcription files, for a total duration of 120 minutes and 41.90 seconds.

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

Although efforts were made to ensure consistency in transcribing the audio recording, the orthography may not be strictly standardised. This is because, although thes transcribers are trained linguists, they do not specialize in African linguistics. Users of this dataset who wish to apply it to TTS tasks may need to check the orthography again to ensure it is standardiszd.

Forbidden Usage

Generative AI, reproduction, duplication, modification, augmentation, copying, distribution, transmission, display, sale, transfer, publication or creation of derivative works without the explicit permission of the the legal owner of the dataset.

Processes

Intended Use

he dataset is suitable for speech-related tasks. - Automatic speech recognition (ASR): Audio–text alignment allows the training or evaluation of speech recognition models for Laari, which is a valuable tool for language technology. Please be aware that the orthography used in the transcription of audio recordings may not be strictly standardized. - Text-to-speech (TTS): The dataset contains clean sentence–audio pairs read by two female speakers, totalling 6.1 hours. This makes the dataset suitable for the evaluating of text-to-speech models. Again, be aware that the orthography used in the transcription of audio recordings may not be strictly standardized

Metadata

Language

Laari (ldi) is a language of the Niger-Congo family, spoken in the Congo by the Lari people, a subgroup of the Kongo, primarily in the Pool Region and cities like Brazzaville in the Republic of the Congo, serving as a major local language, known for its unique phonetics like [DI] often becoming [RI] and [K] becoming [C] or [TS]. It's considered stable but faces potential endangerment as it's not taught in schools, though efforts exist for bilingual dictionaries. Laari has variant names such as Ladi, Laadi, Baladi, Balari, Kilari, Hangala, and Ghaangala.

Variants

Alphabet

The following is not a formal Laari alphabet, but rather the set of graphemes used to transcribe audio files in this dataset: a, b, ch, d, dz, e, f, g, h, i, k, l, m, n, o, p, r, s, t, tch, u, v, w, y, z, mb, mp, nd, ng, nk, ns, nt, nz, aa, ee, ii, oo, uu.

Source

This dataset was created using self-audio recordings of two female native speakers. The speakers then transcribed the recordings. This task aimed to produce datasets suitable for developing text-to-speech models for the Laari language. The speakers were guided through the process using open questions provided by the research coordinator.

Domain

The questions which prompted the speech recorded by the native speaker of Laari covered a variety of domains relevant to the cultural practices of the Laari community, and pertained to mostly the following genres: procedural, opinion and philosophical.

Size

Total size is 568,26 MB

Structure

6,311 audio clips totalling 241 minutes and 44.97 seconds;
an audio mapping file with 5,321 lines, each beginning with the name of an audio file, followed by a tab and then the corresponding text excerpt;
two raw audio files totalling 120 minutes and 54.90 seconds;
two long audio files with their original, non-split transcription files, for a total duration of 120 minutes and 41.90 seconds.

Sample

Laari-TTS-Dataset_12_T1923_T1924.wav bio biabiantsoni bi tu sarichi mu tula wuma
Laari-TTS-Dataset_12_T1931_T1932.wav bantu bo ba lambula moko mu tu sarisa
Laari-TTS-Dataset_12_T1899_T1900.wav (silence)
Laari-TTS-Dataset_12_T1900_T1901.wav ra ba téla wo yenda ku zulu kwangu wa tékéla ko
Laari-TTS-Dataset_12_T1874_T1875.wav ntia na tima tia mpamba mpamba
Laari-TTS-Dataset_12_T1868_T1869.wav buna bu wa ba binkuti pélé, timpéné
Laari-TTS-Dataset_12_T1837_T1838.wav ni bu tu ta wangana kwa ra méso ma bantu
Laari-TTS-Dataset_12_T1814_T1815.wav ku ba na lulendo ra méso ma bantu bana ba ku sarichi ko
Laari-TTS-Dataset_12_T1931_T1932.wav bantu bo ba lambula moko mu tu sarisa
Laari-TTS-Dataset_12_T1923_T1924.wav bio biabiantsoni bi tu sarichi mu tula wuma