Common Voice Scripted Speech 23.0 - Tuki

Locale: bag

Size: 218.04 MB

Task: ASR

Format: MP3

License: CC-0


Tukí — Tuki (bag)

This datasheet is for version 23.0 of the the Mozilla Common Voice Scripted Speech dataset for Tuki (bag). The dataset contains 12 hours of recorded speech (12 hours validated) from 14 speakers.

Language

Tuki is an indigenous language of Cameroon. It belongs to the Niger-Congo language family. According to Ethnologue, the vitality status of the Tuki is stable, and the language is used as a first language by everyone in the ethnic community. However, this is not confirmed by any recent study. In fact, given the general negative trend in the vitality of indigenous languages in Cameroon and other parts of Africa due to factors such as rural exodus, the shift to colonial languages such as French, and language policy, among others, it is more likely that the vitality of Tuki is threatened.

Variants

The Administrative Atlas of Cameroon's Languages (Breton and Bikia Fohtung 1993) lists 6 dialects of Tuki:

  • Tungoro spoken by the Aki around the Ngoro Subdivision

  • Tukombe spoken by the Kombe (or Bakombe)

  • Tonjo spoken by the Nju (or Bunju)

  • Tutsingo spoken by the Tsingo (or Batsingo)

  • Tocenka spoken by the Tiki also known as Bacenga

  • Tumbele spoken by the Mbele (also known as Bambele)

Of the 6 dialects, two are reprensented in the dataset, namely, Tukombe and Tutsingo.

Demographic information

The dataset includes the following distribution of age and gender.

Gender

Self-declared gender information, percentage refers to the number of clips annotated with this gender.

GenderPertentage
Undefined84.0%
Female Feminine2.0%
Do Not Wish To Say14.0%

Age

Self-declared age information, percentage refers to the number of clips annotated with this age band.

Age BandPercentage
Undefined84.0%
Thirties2.0%
Fourties14.0%

Writing system

The collection os sentence prompts provided by the language representatives aligns with the General Alphabet of Cameroonian Languages

Sample

There follows a randomly selected sample of five sentences from the corpus.

  1. Wa arimbana nà mbutu râ ambōh râmê.

  2. Ku ibirichi indjê, ngu ubamú arandissaki amê i tôngô.

  3. Allô ; mbérénô kǔkú ukú uzu, atê i tà tumba ?

  4. Warrôndô wa mà su mitsa tu bia mbîmbà râā nà yêndzichina.

  5. A mà zu tsaka wingāna wussêkêrê.

Automatic random samples

Wa mà sú saa na ibiana.
Umuênê Waringa a mà kutu djï atsôra matuwa.
Wa tirimiya wuwanduwôrrô, wubindiya, wutsakeya na wubari râ mukú mà wurrôndô nà adôngô nà wuchiô wa wussi wa MUTU.
Indi mutu a timbāna nà arandissaki wussi, a yānamú udjï bênêbê nà imbîngô ya mbérénô.
Mbétékénô zê udzanamú itéka manônōh má wenga.

Community links

Contribute

https://commonvoice.mozilla.org/bag

Acknowledgements

The compilation of this dataset occured during data camp organized in Yaoundé (Cameroon) in May 2025. Two main contributors were involved in the localization of the MCV interface for Tuki, gathering of the sentence prompts, reading sentence prompts, and validating recordings. They are :

  • Jean-Louis Aimé Mbataka

  • Marguérite Flore Ndjana

The organization of the data camp was conducted by a dynamic whose dedication is herewith acknowledged :

  • Dr. Florus Landry Dibenge (Project Lead)

  • Eliette Emilie-Caroline Ngo Tjomb Assembe

  • Eric Koung

  • Blaise Mathieu Banum Manguele

  • Martial Brice Antangana Eloundou

  • Emmanuel Giovanni Eloundou Eyenga

Datasheet authors

Emmanuel Ngue Um ngueum@gmail.com Jean-Louis Aimé Mbataka jlam1709@gmail.com Marguérite Flore Ndjana ndjanamarguerite@gmail.com

Funding

The organization of the data camp that led to the compilation of this dataset was made possible by a grant by the Mozilla Foundation under the Opem Multingual Speech Fund (OMSF)

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.