Suundi-TTS-Dataset

License icon

License:

NOODL-1.0

Shield icon

Steward:

Institute of African Digital Humanities

Task: TTS

Release Date: 12/11/2025

Format: WAV, TSV

Size: 240.50 MB


Share

Description

The dataset consists of paired audio and text data on Suundi (sdj), a language spoken in Congo. The audio corpus consists of 4,187 clips read by one speaker totaling 188 min 22.68 sec. The dataset also contains a mapping file of audio and text with 4,185 lines. Each line begins with the name of an audio file, followed by a tab and then the corresponding text excerpt. This dataset is suitable for TTS tasks.

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

Although the transcription of the audio recording was created using a standardised writing system, it may not reflect the standard used by the wider Suundi community. This is often the case in communities using a low-resource language, where different writing standards may be in use. Therefore, this should be taken into account when using this dataset for TTS and ASR tasks. Ideally, the resulting TTS or ASR models should explicitly state which writing system was used.

Forbidden Usage

Generative AI, reproduction, duplication, modification, augmentation, copying, distribution, transmission, display, sale, transfer, publication or creation of derivative works without the explicit permission of the the legal owner of the dataset.

Processes

Intended Use

The dataset is suitable for speech-related tasks. - Automatic speech recognition (ASR): Audio–text alignment allows the training or evaluation of speech recognition models for Suundi, which is a valuable tool for language technology. Please be aware that the read sentences are written in a writing standard, which may co-exist with other writing standards within the Suundi community. - Text-to-speech (TTS): The dataset contains clean sentence–audio pairs read by the same speaker, totalling 4.7 hours. This makes the dataset suitable for the evaluating of text-to-speech models. Again, note that the writing standard may be just one of several in use within the community.

Metadata

Language

Suundi (also spelled Sundi or Nsundi) is a Bantu language of the large Niger–Congo family, spoken by the Sundi people — a subgroup of the Kongo ethnic community — primarily in the Republic of the Congo and extending into parts of Angola (especially Cabinda). As a member of the Kongo (Zone H.10) group of Bantu languages, Suundi shares key structural features typical of this branch, such as noun class systems and agglutinative verbal morphology, which are widespread across Bantu languages. While traditionally unwritten, the language has been documented in wordlists and community materials, reflecting its stable intergenerational transmission even though it lacks institutional support in formal education and media.

Variants

Alphabet

The alphabet used in the transcription of audio resources in this dataset is as follows: a, b, bw, d, e, f, fw, g, gw, h, i, k, kw, l, m, mb, mf, mp, mw, n, nd, ng, nk, ns, nt, nz, nw, o, p, r, s, sw, t, ts, tw, u, v, w, y, z, zv.

Source

This dataset was created using self-audio recordings of a male native speaker. The speaker then transcribed the recordings. This task aimed to produce datasets suitable for developing text-to-speech models for the Suundi language. The speaker was guided through the process using open questions provided by the research coordinator.

Domain

The questions which prompted the speech recorded by the native speaker of Suundi covered a variety of domains relevant to the cultural practices of the Suundi community, and pertained to mostly the following genres: procedural, opinion and philosophical.

Size

Total size is 240,50 MB

Structure

The audio corpus consists of 4,187 clips read by one speaker totaling 188 min 22.68 sec. The dataset also contains a mapping file of audio and text with 4,185 lines. Each line begins with the name of an audio file, followed by a tab and then the corresponding text excerpt

Sample

  1. Suundi_TTS_01_T859_T860.wav ba fumu buala buwe buwe ba na kuwe sadidi

  2. Suundi_TTS_01_T860_T861.wav buwe buwe ba na kuwe vuga moyo

  3. Suundi_TTS_01_T840_T841.wav kassi gue we kuela kuaku muketo ya kissita

  4. Suundi_TTS_01_T828_T829.wav kassi fo pa ba zaba ke yeto ku tsoeto

  5. Suundi_TTS_01_T813_T814.wav fo laba na ludedomo missamu miweno

  6. Suundi_TTS_01_T775_T776.wav mu makuela bosso kuedidi muketo

  7. Suundi_TTS_01_T769_T770.wav makuela yidia kima ya betebete ko

  8. Suundi_TTS_01_T745_T746.wav tsodiani wa na kuwe fuyidi passi

  9. Suundi_TTS_01_T723_T724.wav baneto zi ba boga ba yeni ku banawu

  10. Suundi_TTS_01_T702_T703.wav buwe buwe bakulu ba be kuwe bedi