Common Voice Scripted Speech 23.0 - Tuki
Locale: bag
Size: 218.04 MB
Task: ASR
Format: MP3
License: CC-0
Tukí — Tuki (bag
)
This datasheet is for version 23.0 of the the Mozilla Common Voice Scripted Speech dataset
for Tuki (bag
). The dataset contains 12 hours of recorded
speech (12 hours validated) from 14 speakers.
Language
Tuki is an indigenous language of Cameroon. It belongs to the Niger-Congo language family. According to Ethnologue, the vitality status of the Tuki is stable, and the language is used as a first language by everyone in the ethnic community. However, this is not confirmed by any recent study. In fact, given the general negative trend in the vitality of indigenous languages in Cameroon and other parts of Africa due to factors such as rural exodus, the shift to colonial languages such as French, and language policy, among others, it is more likely that the vitality of Tuki is threatened.
Variants
The Administrative Atlas of Cameroon's Languages (Breton and Bikia Fohtung 1993) lists 6 dialects of Tuki:
Tungoro spoken by the Aki around the Ngoro Subdivision
Tukombe spoken by the Kombe (or Bakombe)
Tonjo spoken by the Nju (or Bunju)
Tutsingo spoken by the Tsingo (or Batsingo)
Tocenka spoken by the Tiki also known as Bacenga
Tumbele spoken by the Mbele (also known as Bambele)
Of the 6 dialects, two are reprensented in the dataset, namely, Tukombe and Tutsingo.
Demographic information
The dataset includes the following distribution of age and gender.
Gender
Self-declared gender information, percentage refers to the number of clips annotated with this gender.
Gender | Pertentage |
---|---|
Undefined | 84.0% |
Female Feminine | 2.0% |
Do Not Wish To Say | 14.0% |
Age
Self-declared age information, percentage refers to the number of clips annotated with this age band.
Age Band | Percentage |
---|---|
Undefined | 84.0% |
Thirties | 2.0% |
Fourties | 14.0% |
Writing system
The collection os sentence prompts provided by the language representatives aligns with the General Alphabet of Cameroonian Languages
Sample
There follows a randomly selected sample of five sentences from the corpus.
Wa arimbana nà mbutu râ ambōh râmê.
Ku ibirichi indjê, ngu ubamú arandissaki amê i tôngô.
Allô ; mbérénô kǔkú ukú uzu, atê i tà tumba ?
Warrôndô wa mà su mitsa tu bia mbîmbà râā nà yêndzichina.
A mà zu tsaka wingāna wussêkêrê.
Automatic random samples
Wa mà sú saa na ibiana.
Umuênê Waringa a mà kutu djï atsôra matuwa.
Wa tirimiya wuwanduwôrrô, wubindiya, wutsakeya na wubari râ mukú mà wurrôndô nà adôngô nà wuchiô wa wussi wa MUTU.
Indi mutu a timbāna nà arandissaki wussi, a yānamú udjï bênêbê nà imbîngô ya mbérénô.
Mbétékénô zê udzanamú itéka manônōh má wenga.
Community links
Contribute
https://commonvoice.mozilla.org/bag
Acknowledgements
The compilation of this dataset occured during data camp organized in Yaoundé (Cameroon) in May 2025. Two main contributors were involved in the localization of the MCV interface for Tuki, gathering of the sentence prompts, reading sentence prompts, and validating recordings. They are :
Jean-Louis Aimé Mbataka
Marguérite Flore Ndjana
The organization of the data camp was conducted by a dynamic whose dedication is herewith acknowledged :
Dr. Florus Landry Dibenge (Project Lead)
Eliette Emilie-Caroline Ngo Tjomb Assembe
Eric Koung
Blaise Mathieu Banum Manguele
Martial Brice Antangana Eloundou
Emmanuel Giovanni Eloundou Eyenga
Datasheet authors
Emmanuel Ngue Um ngueum@gmail.com Jean-Louis Aimé Mbataka jlam1709@gmail.com Marguérite Flore Ndjana ndjanamarguerite@gmail.com
Funding
The organization of the data camp that led to the compilation of this dataset was made possible by a grant by the Mozilla Foundation under the Opem Multingual Speech Fund (OMSF)
Licence
This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.