Bamun-French Parallel Corpus 2.0

License:

NOODL-1.0

Steward:

Institute of African Digital Humanities

Task: MT

Release Date: 3/25/2026

Format: TSV

Size: 184.29 KB

Description

This dataset is an extended and updated version of the 'Bamun-French Parallel Corpus 1.1' that is published on the Mozilla Data Collective platform. It is a parallel corpus of 4,444 lines in Bamun and French suitable for machine translation tasks. The text was obtained by transcribing raw audio files. Translations were added to enrich the original corpus. Bamun and French text alignment was performed in the process of creating this dataset. This version of the dataset resolves formatting issues flagged in the original and nearly doubles the number of aligned translation units compared to version 1.1.

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

By downloading this dataset, you agree: - To use it for research and scientific use only - that you will not re-host or re-share this dataset

Forbidden Usage

You agree not to use the data for: - Generative AI; reproduction; duplication; modification; augmentation; copying; distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the legal owner of the dataset.

Processes

Intended Use

This dataset is intended for the training or testing of machine learning models. Its purpose is to support the learning and revitalisation of the Bamun (Shupamen) language.

Metadata

Language

Bamun or Shüpamom/Shupamem is a Bantu-Grassfield language spoken in the Noun Division, West Region in Cameroon.

Variants

The Bamun language is quite homogeneous within their indigenous territory, the Noun Administrative Division. However, the Administrative Atlas of Cameroon's Languages (Breton and Bikia Fohtung, 1991) indicates a few "islands" outside the Noun Department where the Bamun language exhibits minor variations. These include Bapi in the Mifi Division in the West Region and Bamalang and Bangolan in the Mezam Division in the Northwest Region.

Writing System

1. Vowels

The vowel inventory reflected in the dataset is: i, e, ɛ, a, ɔ, o, u, ʉ, ə The vowel ə / ә is particularly frequent and functions as a central vowel.

2. Consonants

The consonant system includes the following simple consonants: b, d, f, g, h, j, k, l, m, n, ŋ, p, r, s, t, v, w, y, z Complex and cluster-like consonants attested include: mb, nd, nk, ng, nt, nj, mf, kp Digraphs: sh, gh

3. Tone system

The transcription encodes lexical tone using diacritics, corresponding to standard tonal categories:

High tone (H): marked with acute accent (á, é, ɔ́, ʉ́, ŋ́)
Low tone (L): marked with grave accent (à, è, ɔ̀)
Mid tone (M): marked with macron (ā, ē)
Rising tone (LH): marked with caron (ǎ, ě, ɔ̌)
Falling tone (HL): marked with circumflex (â, ê)

Source

This dataset originates from audio recordings documenting personal histories of German colonisation. These recordings were made in the early eighties as part of a research project led by Prince (Professor) Koum A Ndoumbe III.

Abdou Salam Ntieche Fifen created the transcriptions and French translations in this dataset. The transcriptions were made in 2017 under the coordination of the AfricAvenir Foundation. For the purpose of creating this dataset, Abdou Salam Ntieche Fifen has aligned and quality-checked the transcribed text and its translations.

Domain

This dataset is a transcription of prompted speech in the form of a directed interview. The aim of the interview was to elicit personal stories about the German colonial experience in Cameroon. Similar interviews were conducted in many other languages and locations across Cameroon.

Size

552.60 KB

Structure

This parallel corpus comprises 4,444 lines, each consisting of a translation unit in both the source and target languages. The Bamun source text has 50,161 tokens, while the French target text has 43,646 tokens.

The table below compares the key statistics of version 2.0 with the previously published version 1.1.

Metric	Version 1.1	Version 2.0	Change
File size	99.24 KB	552.60 KB	+457.36 KB
Lines	2,301	4,444	+2,143
Bamun tokens	26,712	50,161	+23,449
French tokens	23,739	43,646	+19,907

Sample

Euh mí u tóóshә́ ŋwәt ru, nzíé yúá yʉ́ә́ u púá' yírә́ nә́. | Que tu te présentes, tu dis ce que tu es.
Í nzie Li shá? | Veut-il parler de mon nom ?
Li shú, nә nguu yúá u púá' yírә́ nә́, nә́ nguu yúá u yí nә́. | Ton nom, et tout ce que tu es, et tout ce que tu sais.
Ndǔ lʉ́m mú, ká u lá' nә́, tә njʉ́ nә́. | Ton âge, comme tu as vécu, dans le monde.
Mbúá' NJI FIFEN Andre Kʉ́ә́ndap | Je suis NJI FIFEN ANDRE Kouendap …