Bati-MultiDialectalASR-Dataset

License:

NOODL-1.0

Steward:

Institute of African Digital Humanities

Task: ASR

Release Date: 12/16/2025

Format: WAV, TSV

Size: 3.27 GB

Description

This dataset contains paired audio and text resources for three Bati dialects (Kelleng, Mbougue, and Nyambat), which belong to the Yambasa group of Bantu languages found in Cameroon. It contains 13,344 audio clips totalling 6 hours, 8 minutes and 12.286 seconds and 44 audio/text mapping files totalling 13,346 lines. Due to its cross-dialectal nature, the dataset is suitable for multilingual automatic speech recognition tasks.

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

This dataset contains multilingual audio clips and text excerpts that correspond to the audio recordings. These resources are derived from audio recordings of spontaneous (Kelleng) and elicited (Mbougue and Nyambat) speech. The texts are original in that they reflect a complex set of language repertoires typical of rural multilingualism in African settings. The recordings in Kelleng, Mbougue and Nyamnat are mixed with Basaa speech, reflecting the researcher's analysis language. The author of this dataset is a native Basaa speaker. He used Basaa when interacting with Bati speakers. The Bati speakers responded in either Kelleng, Mbougue, Nyamnat or Basaa.

Forbidden Usage

Generative AI, reproduction, duplication, modification, augmentation, copying, distribution, transmission, display, sale, transfer, publication or creation of derivative works without the explicit permission of the the legal owner of the dataset.

Processes

Intended Use

The dataset is suitable for speech-related tasks. - Automatic speech recognition (ASR): Audio–text alignment allows the training or evaluation of speech recognition models for Bati, which is a valuable tool for language technology. Please note that the orthography used in the transcription of audio recordings is largely phonetic, although a few mapping files reflect a fully Latin-based writing system. - Text-to-speech (TTS): The dataset contains clean text–audio pairs from different Bati speakers and casual Basaa speech, totalling 6.8 hours. This makes the dataset suitable for training and evaluating automatic speech recognition models. However, please note that the orthography used in the transcription of audio recordings is not uniform.

Metadata

Language

Bati (btc) is a language of Cameroon belonging to the Yambasa subgroup of Guthrie’s (1971) Bantu A classification. It is spoken in three villages (Kelleng, Mbougue and Nyambat) with a very limited speaker population. Estimates place the number of speakers at around 800, although the resident population of the three Bati-speaking villages was fewer than 200 individuals as of 2019. The language is under-documented and faces strong pressure from dominant regional languages.

Variants

The name Bati functions as a cover term for three related varieties: Kelleng, Mbougue, and Nyambat. Depending on the criteria adopted, these may be analyzed either as dialects of a single language or as distinct languages. While the Administrative Atlas of Cameroonian Languages groups them together as dialects based largely on lexical similarity, speakers themselves clearly identify Kelleng, Mbougue, and Nyambat as separate languages.

Alphabet

The alphabet is not uniform throughout text resources of this dataset. Most text is transcribed using the International Phonetic Alphabet (API), using the following set of characters: a, aː, ã, b, d, e, eː, ɛ, ɛː, ɛ̃, ə, f, g, h, i, iː, ĩ, j, k, kʷ, l, m, mb, n, nd, ɲ, ŋ, ŋg, ŋʷ, o, oː, ɔ, ɔː, ɔ̃, p, r, s, t, u, uː, ũ, w, z, β, ɣ, χ, ʔ, ʧ, ʤ

However, a segment of the text is transcribed using simplified Latin alphabet. This applies to the following mapping files in the folder Kelleng, Subfolder HIST_03:

audio_mapping_BTC_MDP0332_Kell_HIST03_6-12_2017-06-14.tsv
audio_mapping_BTC_MDP0332_Kell_HIST03_7-12_2017-06-14.tsv
audio_mapping_BTC_MDP0332_Kell_HIST03_8-12_2017-06-14.tsv
audio_mapping_BTC_MDP0332_Kell_HIST03_9-12_2017-06-14.tsv
audio_mapping_BTC_MDP0332_Kell_HIST03_10-12_2017-06-14

The list of graphemes used for the Latin-based script is as follows: A, aa, b, bw,, d, dj, e, é, ee, f, g, h, I, ii, k, kw, l, m, mb, mw, n, nd, ng, nk, nt, nj, ny, o, ô, p, pw, r, s, t, tj, ts, w, y.

Source

This dataset re-uses the audio recordings and ELAN files compiled for the Documentation of the Bati Language and Oral Traditions project. The resulting corpus has been archived with the Endangered Languages Archive (ELAR). The author of this dataset was the recipient of funding from the Endangered Languages Documentation Programme (ELDP)'s Major Grants scheme (project MDP0332), which resulted in the corpus materials being extracted for this dataset.

Domain

This dataset contains audio material covering several domains, including the history and cosmogony of the Bati people, their social organisation, narratives, myths but also grammar elicitation.

Size

Total size is 3,27 GB

Structure

This dataset comprises two types of resource: audio clips (.WAV) and text. The audio clips are contained in separate folders and the text is contained in a TSV-type audio/text mapping file. The three main folders are named Kelleng, Mbougue and Nyambat, respectively. Each of these folders contains subfolders whose names refer to a particular domain of language use. For example, 'HIST' refers to historical life and 'Gramm_elicit' refers to grammar elicitation. The lowest level of the folder structure contains folders that hold the audio files. These lower-rank folders are named after the original audio and Elan files from which they were generated. The dataset consists of 13,344 audio clips totalling 6 hours, 8 minutes and 12.286 seconds, as well as 44 audio/text mapping files totalling 13,346 lines.

Sample

Kelleng Dialect

International Phonetic Alphabet-based text

kìpós ʧɛ́ ʧé ʧâ jálêk
ʧé mú à mɛ́ní ʧîn ʧɛ̀
mì sòmbláχ lɛ́ ↓mì ↑ɓàr màmá nù à ŋ̀gwéénɛ́ ná nú à jè ɓěh βìsú
jɔ́n mɛ̀ ɓàráχ wɛ̂ lɛ́ jàχ wɛ̀ bí↓nɔ́χ íjàm líjé ɲɔ̀ɔ̂ ɛ̀
ŋgò kòònd ŋgójâŋ wà wɔ́ ní
ɓákwís βìkɔ̀l βí ɓóɲá nɔ́ ɓàkáná ɓí nì ŋgìm ŋʷɔ́m jɔ́ɔ́β ɔ̂ɸ
wǎ wì ná kɛ́k ↓wárâj wǎ jínân ŋgǐm ↓mbóɣrín kó ò

Latin-based text

tjé tji na tjel kak nini
maay mis omorom ma nyim waan
ti na ti kol nza ni bisômbin
kitoñ ék tja bôli kitéñbar
kaa a ñgala yis éy tja bôli kiteñbini
ndi ba tubin kiseen ki mbore woom
bakôn ba beend weni ba ta ntobane

Mbougue Dialect

kíŋɛ̂ já móŋsí ɓáànd ɛ́
ŋ̀kòmbà à mbúl kíŋɛ̂ ꜛɓǐs
núꜛmbàn ɓâm ɓáꜛànd
núꜛmbàn ɔ̂p ɓɔ́ ɓáꜛàn
núꜛmbàn má móŋsí ɓáànd ɛ́
káánd ɓá móŋsí ɓáꜛànd
jègén wáàn mbúùm já ní ní kúŋwâ mbún í ʧíìná à tàn

Nyambat Dialect

mbwáŋ ǒ ꜜhá nɛ̂ ɓǎh tì mbìsě
sɛ̀ɛ́n tʒâm ɛ̀nɛ
tǔhꜜná tʃwɛ́ꜜhɛ́ tʃû m̀fèrbǎt
ɓáꜜlɛ́ ùɓâχnà bîχ hâ kí mɛ̀ńtilà
wìn nì kitùùrínɛ̂ à tùhnâ
ǹlɔ́ŋ û jé ↓lɛ́ matóà má ńlòò kìɓàlà