Teke-Laali-TTS-Dataset
License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesTask: TTS
Release Date: 12/16/2025
Format: WAV, TSV
Size: 635.61 MB
Share
Description
The dataset contains paired audio and text resources for Teke-Laali, a Bantu language spoken in the Congo. It consists of seven folders containing a total of 9,069 audio clips from raw audio recordings, with a total duration of 7:01:50.126 (HH:MM:SS.mmm). Additionally, there are seven audio/text mapping files containing a total of 9,069 lines. The dataset is suitable for TTS tasks.
Specifics
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseConsiderations
Restrictions/Special Constraints
Although efforts were made to ensure consistency in transcribing the audio recording, the orthography may not be strictly standardised. This is because, although the transcriber is a trained linguist, they do not specialize in African linguistics. Users of this dataset who wish to apply it to TTS tasks may need to check the orthography again to ensure it is standardized.
Forbidden Usage
Generative AI, reproduction, duplication, modification, augmentation, copying, distribution, transmission, display, sale, transfer, publication or creation of derivative works without the explicit permission of the the legal owner of the dataset.
Processes
Intended Use
The dataset is suitable for speech-related tasks. - Automatic speech recognition (ASR): Audio–text alignment allows the training or evaluation of speech recognition models for Teke-Laali, which is a valuable tool for language technology. Please be aware that the orthography used in the transcription of audio recordings may not be strictly standardized. - Text-to-speech (TTS): The dataset contains clean sentence–audio pairs read one male speaker, totalling 7.1 hours. This makes the dataset suitable for the training and evaluation of text-to-speech models. Again, be aware that the orthography used in the transcription of audio recordings may not be strictly standardized
Metadata
Language
Teke-Laali (lli) also known as Ilaali, Laali, or West Teke is a Bantu language spoken by the Teke people in the Republic of Congo. It belongs to the Niger-Congo family, and is considered threatened as fewer young people learn it, though adults in the community still use it as a first language in home settings.
Variants
Teke-Laali is part of continuum that includes Tsaayi, (tyi) Yaka or Yaa (iyx), and Tyee (tyx).
Alphabet
The set of characters used in the creation of transcriptions from audio recordings is as follows: a, b, bw, d, f, g, gw, h, i, k, kw, l, m, mb, mw, n, nd, ng, nk, nw, nz, o, p, s, t, u, v, w, y, z,.
Source
This dataset was created using self-audio recordings of one male native speakers. The speaker then transcribed the recordings. This task aimed to produce datasets suitable for developing text-to-speech models for the Teke-Laali language. The speaker were guided through the process using open questions provided by the research coordinator.
Domain
The speech, recorded by a native speaker of Teke-Laali, covered a variety of topics relevant to the cultural practices of the Teke community. The speech mostly pertained to the following genres: procedural, opinion and philosophical.
Size
Total size is 635,61 MB
Structure
7 folders containing 9,069 audio clips and totalling 7:01:50.126 (HH:MM:SS.mmm)
7 audio mapping files with 9,069 lines, each beginning with the name of an audio file, followed by a tab and then the corresponding text excerpt;
Sample
0001.wav efule ki ma ki
0010.wav plat la ngoule mou mako kou nda ba beme
0021.wav ngoule wa bo la a ba wa mendele pe
0025.wav donc bou ba wole ngoule mo nde
0052.wav ngoule mo nde o boua sa ndine mo ngoua
0061.wav ngoule mo nde o bwoua lame boune bobo
0071.wav faut ko ba na mwoua bierre ha côté
0113.wav yo lia lo yebe tsime bio bussine ku bo la
0188.wav mo kisse mo nde wa ba wa mokate
0214.wav ba bwa yiro mone ye mou tsa nzo ye momo