Bomitaba-TTS-Dataset
License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesTask: TTS
Release Date: 12/12/2025
Format: WAV, TSV
Size: 1.00 GB
Share
Description
The dataset comprises three components: audio clips, an audio mapping file, and raw audio of Bomitaba, a Bantu language spoken in the Congo. Each audio clip is paired with its corresponding transcription. There are 2,613 transcribed audio clips, totalling 182 minutes and 4 seconds. There are two raw audio files totalling 121 minutes and 14.24 seconds. The audio mapping file contains 2,610 lines. Each line begins with the name of an audio file, followed by a tab, then the corresponding text exce
Specifics
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseConsiderations
Restrictions/Special Constraints
Although efforts were made to ensure consistency in transcribing the audio recording, the orthography may not be strictly standardised. This is because, although the transcriber is a trained linguist, they do not specialize in African linguistics. Users of this dataset who wish to apply it to TTS tasks may need to check the orthography again to ensure it is standardized.
Forbidden Usage
Generative AI, reproduction, duplication, modification, augmentation, copying, distribution, transmission, display, sale, transfer, publication or creation of derivative works without the explicit permission of the the legal owner of the dataset.
Processes
Intended Use
The dataset is suitable for speech-related tasks. - Automatic speech recognition (ASR): Audio–text alignment allows the training or evaluation of speech recognition models for Bomitaba, which is a valuable tool for language technology. Please be aware that the orthography used in the transcription of audio recordings may not be strictly standardized. - Text-to-speech (TTS): The dataset contains clean sentence–audio pairs read by the same speaker, totalling 3.0 hours. This makes the dataset suitable for the evaluating of text-to-speech models. Again, be aware that the orthography used in the transcription of audio recordings may not be strictly standardized
Metadata
Language
Bomitaba (zmx) is an endangered Bantu language spoken by the Bomitaba people, primarily in the Likouala region of the Republic of the Congo, with some speakers in the Central African Republic, belonging to the Niger-Congo family. It's a tonal language used in daily life but faces pressure from French and Lingala.
Variants
Northern (Matoki) and Central (Epena)
Alphabet
The alphabet used in the transcription of audio resources in this dataset is as follows:
Vowels: a e i o u (+ á à é è í ì ó ò ú ù)
Consonants: b d f g h k l m n p s t v w y z ch dj dz gb kg mb mf mp mv nd ng nk nt nz ndz ngb tch
Source
This dataset was created using self-audio recordings of a male native speaker. The speaker then transcribed the recordings. This task aimed to produce datasets suitable for developing text-to-speech models for the Bomitaba language. The speaker was guided through the process using open questions provided by the research coordinator.
Domain
The questions which prompted the speech recorded by the native speaker of Bomitaba covered a variety of domains relevant to the cultural practices of the Bomitaba community, and pertained to mostly the following genres: procedural, opinion and philosophical.
Size
Total size is 1,00 GB
Structure
The audio corpus consists of 2,613 transcribed audio clips, totalling 182 minutes and 4 seconds ; raw audio files totalling 121 minutes and 14.24 seconds in addition to; and a mapping file of audio and text with 2,610 lines. Each line begins with the name of an audio file, followed by a tab and then the corresponding text excerpt
Sample
Bomitaba-TTS-Dataset_01_T822_T823.wav epaténdina edzi epata yoko yefuta boso
Bomitaba-TTS-Dataset_01_T823_T824.wav (silence)
Bomitaba-TTS-Dataset_01_T825_T826.wav bangwayènè mawo ebelé belé tché lobo badzonga efè na bomo
Bomitaba-TTS-Dataset_01_T826_T827.wav tché hum tsila mawo mana tsila mawo mana madzi mabé ta midzi mibé
Bomitaba-TTS-Dataset_01_T827_T828.wav nabaka bato lewoko abanga tché lè lè lè
Bomitaba-TTS-Dataset_01_T831_T832.wav èyènè babotsi bawana wadzi kwa wana wadzi futé pata dza lanè bota bana
Bomitaba-TTS-Dataset_01_T833_T834.wav bobo yèni tché koté yikoka bona tikésala enga ipaténa lekokani mbé yakwa mwana nga niki
Bomitaba-TTS-Dataset_01_T834_T835.wav ibanda bobote ibanda booo
Bomitaba-TTS-Dataset_01_T836_T837.wav obatelé ikomba dzawa
Bomitaba-TTS-Dataset_01_T838_T839.wav wadzi menwetchi bito wana no ndaku