Teke-Laali-TTS-Dataset

License icon

License:

NOODL-1.0

Shield icon

Steward:

Institute of African Digital Humanities

Task: TTS

Release Date: 12/16/2025

Format: WAV, TSV

Size: 635.61 MB


Share

Description

The dataset contains paired audio and text resources for Teke-Laali, a Bantu language spoken in the Congo. It consists of seven folders containing a total of 9,069 audio clips from raw audio recordings, with a total duration of 7:01:50.126 (HH:MM:SS.mmm). Additionally, there are seven audio/text mapping files containing a total of 9,069 lines. The dataset is suitable for TTS tasks.

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

Although efforts were made to ensure consistency in transcribing the audio recording, the orthography may not be strictly standardised. This is because, although the transcriber is a trained linguist, they do not specialize in African linguistics. Users of this dataset who wish to apply it to TTS tasks may need to check the orthography again to ensure it is standardized.

Forbidden Usage

Generative AI, reproduction, duplication, modification, augmentation, copying, distribution, transmission, display, sale, transfer, publication or creation of derivative works without the explicit permission of the the legal owner of the dataset.

Processes

Intended Use

The dataset is suitable for speech-related tasks. - Automatic speech recognition (ASR): Audio–text alignment allows the training or evaluation of speech recognition models for Teke-Laali, which is a valuable tool for language technology. Please be aware that the orthography used in the transcription of audio recordings may not be strictly standardized. - Text-to-speech (TTS): The dataset contains clean sentence–audio pairs read one male speaker, totalling 7.1 hours. This makes the dataset suitable for the training and evaluation of text-to-speech models. Again, be aware that the orthography used in the transcription of audio recordings may not be strictly standardized

Metadata

Language

Teke-Laali (lli) also known as Ilaali, Laali, or West Teke is a Bantu language spoken by the Teke people in the Republic of Congo. It belongs to the Niger-Congo family, and is considered threatened as fewer young people learn it, though adults in the community still use it as a first language in home settings.

Variants

Teke-Laali is part of continuum that includes Tsaayi, (tyi) Yaka or Yaa (iyx), and Tyee (tyx).

Alphabet

The set of characters used in the creation of transcriptions from audio recordings is as follows: a, b, bw, d, f, g, gw, h, i, k, kw, l, m, mb, mw, n, nd, ng, nk, nw, nz, o, p, s, t, u, v, w, y, z,.

Source

This dataset was created using self-audio recordings of one male native speakers. The speaker then transcribed the recordings. This task aimed to produce datasets suitable for developing text-to-speech models for the Teke-Laali language. The speaker were guided through the process using open questions provided by the research coordinator.

Domain

The speech, recorded by a native speaker of Teke-Laali, covered a variety of topics relevant to the cultural practices of the Teke community. The speech mostly pertained to the following genres: procedural, opinion and philosophical.

Size

Total size is 635,61 MB

Structure

  • 7 folders containing 9,069 audio clips and totalling 7:01:50.126 (HH:MM:SS.mmm)

  • 7 audio mapping files with 9,069 lines, each beginning with the name of an audio file, followed by a tab and then the corresponding text excerpt;

Sample

  1. 0001.wav efule ki ma ki

  2. 0010.wav plat la ngoule mou mako kou nda ba beme

  3. 0021.wav ngoule wa bo la a ba wa mendele pe

  4. 0025.wav donc bou ba wole ngoule mo nde

  5. 0052.wav ngoule mo nde o boua sa ndine mo ngoua

  6. 0061.wav ngoule mo nde o bwoua lame boune bobo

  7. 0071.wav faut ko ba na mwoua bierre ha côté

  8. 0113.wav yo lia lo yebe tsime bio bussine ku bo la

  9. 0188.wav mo kisse mo nde wa ba wa mokate

  10. 0214.wav ba bwa yiro mone ye mou tsa nzo ye momo