Yaka-TTS-Dataset
License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesTask: TTS
Release Date: 12/10/2025
Format: WAV, TSV
Size: 1.26 GB
Share
Description
Paired audio and text data on Yaka (also known as West Teke), a language spoken in Congo. The audio corpus consists of 7,648 clips read by one speaker for a total duration of 344 min 40.48 sec. The dataset also contains a mapping file of audio and text with 7,648 lines. Each line begins with the name of an audio file, followed by a tab and then the corresponding text excerpt. This dataset is suitable for TTS tasks.
Specifics
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseConsiderations
Restrictions/Special Constraints
Although the transcription of the audio recording was created using a standardised writing system, it may not reflect the standard used by the wider Yaka community. This is often the case in communities using a low-resource language, where different writing standards may be in use. Therefore, this should be taken into account when using this dataset for TTS and ASR tasks. Ideally, the resulting TTS or ASR models should explicitly state which writing system was used.
Forbidden Usage
Generative AI, reproduction, duplication, modification, augmentation, copying, distribution, transmission, display, sale, transfer, publication or creation of derivative works without the explicit permission of the the legal owner of the dataset.
Processes
Intended Use
The dataset is suitable for speech-related tasks. - Automatic speech recognition (ASR): Audio–text alignment allows the training or evaluation of speech recognition models for Yaka, which is a valuable tool for language technology. Please be aware that the read sentences are written in a writing standard, which may co-exist with other writing standards within the Yaka community. - Text-to-speech (TTS): The dataset contains clean sentence–audio pairs read by the same speaker, totalling 5.7 hours. This makes the dataset suitable for training or evaluating text-to-speech models. Again, note that the writing standard may be just one of several in use within the community.
Metadata
Language
Yaka is spoken by 36,000 people (Ethnologue 2018) living in the Sibiti district of the Lekoumou department of the Republic of Congo. The language is classified as Niger-Congo, Atlantic-Congo, Volta-Congo, Benue-Congo, Bantoid, Southern, Narrow Bantu, Northwest, B. Teke (B.73). The ISO code is [iyx]. Alternate names include lyaa, lyaka, West Teke, and Yaa
Variants
Yaka language has lexical similarities with several other languages: 91% with Laali [lli], 74% with Tsaayi [tyi], and 69% with Tyee [tyx] (SIL-Congo 2022).
Alphabet
The alphabet used to create transcriptions of audio files that constitute this dataset is Latin-based. It contains the following characters: a b c d e f g h i j k l m n ŋ o p r s t u v w y z á à é è ó ò
Source
This dataset was created using self-audio recordings of a male native speaker. The speaker then transcribed the recordings. This task aimed to produce datasets suitable for developing text-to-speech models for the Yaka language. The speaker was guided through the process using open questions provided by the research coordinator.
Domain
The questions which prompted the speech recorded by the native speaker of Yaka covered a variety of domains relevant to the cultural practices of the Yaka community, and pertained to mostly the following genres: procedural, opinion and philosophical.
Size
Total size is 1,26 GB
Structure
The dataset contains 7,648 audio clips and a mapping file of audio and text with 7,648 lines. Each line begins with the name of an audio file, followed by a tab and then the corresponding text excerpt.
Sample
Yaka_TTS_01_T0_T1.wav Baanabo, mabwé mabéné !
Yaka_TTS_01_T1_T2.wav Béne bwise bu taangue yisii ki,
Yaka_TTS_01_T2_T3.wav ŋa diwole béne yo mu yuu,
Yaka_TTS_01_T3_T4.wav Diwole ntangue dia béne duse.
Yaka_TTS_01_T4_T5.wav (((Silence)))
Yaka_TTS_01_T5_T6.wav Mu ntsa kète mu
Yaka_TTS_01_T6_T7.wav Mé sa ntsuu béne mandaa malaa,
Yaka_TTS_01_T7_T8.wav Mandaa mavulule
Yaka_TTS_01_T8_T9.wav Muu sa béne ti diyawe mandaa muu yo kwa balèlé.
Yaka_TTS_01_T9_T10.wav Baanabo, dibanu ati.