Lingala-TTS-Dataset

License icon

License:

NOODL-1.0

Shield icon

Steward:

Institute of African Digital Humanities

Task: TTS

Release Date: 2/25/2026

Format: WAV, TSV

Size: 962.04 MB


Share

Description

The dataset contains audio and text resources in Lingala, a Bantu language spoken in the Republic of the Congo (also known as 'Congo Brazzaville') and the Democratic Republic of the Congo (DRC). These resources are suitable for TTS and ASR tasks and consist of the following: - 8,572 audio clips totalling 4 hours, 25 minutes and 54 seconds; - an audio mapping file containing 8,572 lines.

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

- For research and scientific use only - You agree that you will not re-host or re-share this dataset

Forbidden Usage

You agree not to use the data for: determining the identity of the speaker in the dataset; attempt to clone the voice or train models that imitate the speaker in this dataset; Generative AI; reproduction; duplication; modification; augmentation; copying; distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the the legal owner of the dataset.

Processes

Intended Use

The dataset is suitable for speech-related tasks. - Text-to-speech (TTS): The dataset contains non denoised text–audio pairs of spontaneous speech from one female and one male speaker, totalling 4 hour and 25 minutes. This makes the dataset suitable for the evaluating of text-to-speech models. Be aware that the orthography used in the transcription of audio recordings may not be strictly standardized - Automatic speech recognition (ASR): Audio–text alignment allows the training or evaluation of speech recognition models for Lingala, which is a valuable tool for language technology. Again, be aware that the orthography used in the transcription of audio recordings may not be strictly standardized.

Metadata

Language

Lingala is a Bantu language spoken primarily in the Democratic Republic of the Congo (DRC) and the Republic of the Congo (Congo-Brazzaville). It functions as a major language of wider communication across the northern and western DRC, including the capital Kinshasa, and across the northern half of Congo-Brazzaville, as well as among Congolese diaspora communities worldwide. Lingala has approximately 15 million native speakers, with an additional 10 million using it regularly as a lingua franca, predominantly in rural areas and the diaspora. Lingala emerged from Bobangi, a Bantu language historically used as a trade lingua franca along the Congo River in the precolonial period. Through contact with diverse linguistic communities and the influence of colonial and missionary intervention, Bobangi underwent extensive restructuring. Beginning in the early twentieth century, Catholic missionaries of the Congregatio Immaculati Cordis Mariae (the Scheutists) undertook a deliberate program of grammatical and lexical reform, drawing on indigenous languages including Iboko, Mabale, and original Bobangi to produce a normalized written form. This reformed variety was disseminated through mission schools, religious texts, and print media across the western Congo. The name Lingala itself was coined as part of this reform process, replacing the earlier glossonym Bangala. While the legitimacy of Lingala has been debated — with some characterizing it as an artificially constructed variety and others as a natural contact-induced koine — it is broadly acknowledged that its current geographical spread and sociolinguistic functions are substantially shaped by these missionary and colonial-era interventions.

Variants

Two broad varieties of Lingala are widely recognized. The first is the reformed or standard variety associated with the Scheutist missionary Egide De Boeck, whose 1904 grammar established a normalized form with restored Bantu nominal prefix systems and expanded vocabulary. This variety, sometimes referred to as book Lingala, school Lingala, or written Lingala, became the medium of formal education, Catholic liturgy, and missionary publications across much of the western Congo and gained acceptance among some Protestant missions as well. It remains in use today in formal registers, religious contexts, and media. The second is the spoken urban variety, particularly associated with Kinshasa (formerly Leopoldville). This variety retains features of the pre-reform contact language, including an eroded system of nominal prefixes and syntactic concordance, and was heavily influenced by Kikongo and the languages of diverse migrant communities. Residents of Leopoldville adopted the name Lingala for this variety as well, while distinguishing it informally from the missionary standard. This urban spoken Lingala nativized rapidly and became the dominant vernacular of the capital. A Kinshasa Bible translation in this variety was published in 2000, marking a tentative entry of spoken Lingala into domains previously reserved for the standard form. In the northeastern DRC, the older glossonym Bangala continues to be used alongside or instead of Lingala, reflecting the comparatively weaker penetration of the Scheutist reform in that region.

##Alphabet

StatusLetters
Core Lingala alphabeta, b, d, e, g, i, k, l, m, n, o, p, s, t, u, w, y, z
Integrated diacritic formé
Productive digraphsng, mb, mp, nd, ns, nz, nt
Partially integrated loanword letterf
Loanword-only lettersc, r, v
French-only / incidentalh, j, q, x, à, ç, è, ê, ë, î, ô, û

Source

This dataset was created using self-audio recordings of one female and one male native speakers. The recordings were then transcribed the recordings. This task aimed to produce datasets suitable for developing text-to-speech models for the Lingala language. The speakers were guided through the process using open questions provided by the research coordinator.

Domain

The questions which prompted the speech recorded by the native speaker of Lingala covered a variety of domains relevant to the cultural practices of the indigenous community, and pertained to mostly the following genres: procedural, opinion and philosophical.

Size

Total size is 962,04 MB

Structure

  • 8,572 audio clips totalling 4 hours, 25 minutes and 54 seconds;

  • Five audio mapping files containing 8,572 lines. The dataset contains a visible stratum of French utterances and code-switched passages.

One recurring example of code-switching is the labelling of transcription slots that do not contain audible speech as "silence". Users of the dataset may wish to exclude these slots if they are using it for ASR or TTS tasks.

Sample

  1. Lingala-TTS-Dataset_05_T709_T710.wav | Ko bunda pona ko zonguela abele na bango.

  2. Lingala-TTS-Dataset_04_T211_T212.wav | Eko zala pondu oyo eko zala kaka bongo,

  3. Lingala-TTS-Dataset_04_T327_T328.wav | Azala, asala na biso kaka nati asalaka na kala,

  4. Lingala-TTS-Dataset_04_T105_T106.wav | Otambola malembe malembe kino oko koma na route nationale.

  5. Lingala-TTS-Dataset_04_T191_T192.wav | Oko sokola ye, oko tika ye na pembeni, oko sala osaka na yo,

  6. Lingala-TTS-Dataset_04_T97_T98.wav | Donc eko zala na bato ba kowuta Ollombo ba ko kende ko teka biloko na bango kuna après bazongi na bango.

  7. Lingala-TTS-Dataset_05_T747_T748.wav | Biso kuna to zalaka boye, to vivre- to- to bikaka boye,

  8. Lingala-TTS-Dataset_03_T1457_T1458.wav | ba todisaka yango na matiti ya ebele ya kitoko

  9. Lingala-TTS-Dataset_04_T282_T283.wav | Eko sala été likama ekoki ko komela na ngonga nionso.

  10. Lingala-TTS-Dataset_04_T610_T611.wav | Bo ko yaka côté bo moni bilembo, ba niama bazalaki wana,