Beembe-TTS-Dataset

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

Although the transcription of the audio recording was created using a standardised writing system, it may not reflect the standard used by the wider Beembe community. This is often the case in communities using a low-resource language, where different writing standards may be in use. Therefore, this should be taken into account when using this dataset for TTS and ASR tasks. Ideally, the resulting TTS or ASR models should explicitly state which writing system was used.

Forbidden Usage

Generative AI, reproduction, duplication, modification, augmentation, copying, distribution, transmission, display, sale, transfer, publication or creation of derivative works without the explicit permission of the the legal owner of the dataset.

Processes

Intended Use

The dataset is suitable for speech-related tasks. - Automatic speech recognition (ASR): Audio–text alignment allows the training or evaluation of speech recognition models for Beembe, which is a valuable tool for language technology. Please be aware that the read sentences are written in a writing standard, which may co-exist with other writing standards within the Beembe community. - Text-to-speech (TTS): The dataset contains clean sentence–audio pairs read by the same speaker, totalling 4.5 hours. This makes the dataset suitable for training or evaluating text-to-speech models. Again, note that the writing standard may be just one of several in use within the community.

Metadata

Language

Beembe also known as Kibembe or Kibeembe (Native speakers call it kiBeembe) is a Bantu language spoken by around 100,000 people in the Republic of the Congo, part of the Kongo language cluster, closely related to Kikongo, and is a stable language according to Ethnologue online. Note that Beembe (beq) is not to be confused either with the Bemba (bem) language of Zambia/DRC or the Bembe (bmb) spoken in Tanzania/DRC, though they share roots.

Variants

Dialects: Includes Keenge (kiKeenge) and Yari (kiYari).

Alphabet

The alphabet used to create transcriptions of audio files that constitute this dataset is Latin-based. It contains the following characters: a, e, i, o, u, é, è, ò, b, c, d, f, g, h, k, l, m, n, p, r, s, t, v, w, y, z, mb, mp, mf, mpw, nd, ng, nk, nts, ndz, nj, ngw, kw, gw, sw, bw, fw

Source

This dataset was created using self-audio recordings of a male native speaker. The speaker then transcribed the recordings. This task aimed to produce datasets suitable for developing text-to-speech models for the Beembe language. The speaker was guided through the process using open questions provided by the research coordinator.

Domain

The questions which prompted the speech recorded by the native speaker of Beembe covered a variety of domains relevant to the cultural practices of the Beembe community, and pertained to mostly the following genres: procedural, opinion and philosophical.

Size

Total size is 861,46 MB

Structure

The audio corpus consists of 6,933 clips read by one speaker totaling 275 min 48.35 sec. The dataset also contains a mapping file of audio and text with 4,422 lines. Each line begins with the name of an audio file, followed by a tab and then the corresponding text excerpt.

Sample

TTS-Beembe_01_T54_T55.wav bajène ba yèk kunwa binik dia ngolo
TTS-Beembe_01_T55_T56.wav donc ba yèk na ma tramadol
TTS-Beembe_01_T56_T57.wav ba yèk kunwa dyamba di ba kotéla fome dya di laa
TTS-Beembe_01_T57_T58.wav hum ba yèk buzite dio pè
TTS-Beembe_01_T58_T59.wav ba yèk ma brakère ma kuluna
TTS-Beembe_01_T59_T60.wav donc mbit nia bè prizidon
TTS-Beembe_01_T60_T61.wav mu ngonde dya ntete dzo désizion dya mè dya dzole
TTS-Beembe_01_T61_T62.wav dir ku édika bo
TTS-Beembe_01_T62_T63.wav fo ku kwta majène bo
TTS-Beembe_01_T63_T64.wav kusaka ma mwayin bune tusa