Bamun-French Parallel Corpus 2.0
License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesTask: MT
Release Date: 3/25/2026
Format: TSV
Size: 184.29 KB
Share
Description
This dataset is an extended and updated version of the 'Bamun-French Parallel Corpus 1.1' that is published on the Mozilla Data Collective platform. It is a parallel corpus of 4,444 lines in Bamun and French suitable for machine translation tasks. The text was obtained by transcribing raw audio files. Translations were added to enrich the original corpus. Bamun and French text alignment was performed in the process of creating this dataset. This version of the dataset resolves formatting issues flagged in the original and nearly doubles the number of aligned translation units compared to version 1.1.
Specifics
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseConsiderations
Restrictions/Special Constraints
By downloading this dataset, you agree: - To use it for research and scientific use only - that you will not re-host or re-share this dataset
Forbidden Usage
You agree not to use the data for: - Generative AI; reproduction; duplication; modification; augmentation; copying; distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the legal owner of the dataset.
Processes
Intended Use
This dataset is intended for the training or testing of machine learning models. Its purpose is to support the learning and revitalisation of the Bamun (Shupamen) language.
Metadata
Language
Bamun or Shüpamom/Shupamem is a Bantu-Grassfield language spoken in the Noun Division, West Region in Cameroon.
Variants
The Bamun language is quite homogeneous within their indigenous territory, the Noun Administrative Division. However, the Administrative Atlas of Cameroon's Languages (Breton and Bikia Fohtung, 1991) indicates a few "islands" outside the Noun Department where the Bamun language exhibits minor variations. These include Bapi in the Mifi Division in the West Region and Bamalang and Bangolan in the Mezam Division in the Northwest Region.
Writing System
1. Vowels
The vowel inventory reflected in the dataset is: i, e, ɛ, a, ɔ, o, u, ʉ, ə The vowel ə / ә is particularly frequent and functions as a central vowel.
2. Consonants
The consonant system includes the following simple consonants: b, d, f, g, h, j, k, l, m, n, ŋ, p, r, s, t, v, w, y, z Complex and cluster-like consonants attested include: mb, nd, nk, ng, nt, nj, mf, kp Digraphs: sh, gh
3. Tone system
The transcription encodes lexical tone using diacritics, corresponding to standard tonal categories:
High tone (H): marked with acute accent (á, é, ɔ́, ʉ́, ŋ́)
Low tone (L): marked with grave accent (à, è, ɔ̀)
Mid tone (M): marked with macron (ā, ē)
Rising tone (LH): marked with caron (ǎ, ě, ɔ̌)
Falling tone (HL): marked with circumflex (â, ê)
Source
This dataset originates from audio recordings documenting personal histories of German colonisation. These recordings were made in the early eighties as part of a research project led by Prince (Professor) Koum A Ndoumbe III.
Abdou Salam Ntieche Fifen created the transcriptions and French translations in this dataset. The transcriptions were made in 2017 under the coordination of the AfricAvenir Foundation. For the purpose of creating this dataset, Abdou Salam Ntieche Fifen has aligned and quality-checked the transcribed text and its translations.
Domain
This dataset is a transcription of prompted speech in the form of a directed interview. The aim of the interview was to elicit personal stories about the German colonial experience in Cameroon. Similar interviews were conducted in many other languages and locations across Cameroon.
Size
552.60 KB
Structure
This parallel corpus comprises 4,444 lines, each consisting of a translation unit in both the source and target languages. The Bamun source text has 50,161 tokens, while the French target text has 43,646 tokens.
The table below compares the key statistics of version 2.0 with the previously published version 1.1.
| Metric | Version 1.1 | Version 2.0 | Change |
|---|---|---|---|
| File size | 99.24 KB | 552.60 KB | +457.36 KB |
| Lines | 2,301 | 4,444 | +2,143 |
| Bamun tokens | 26,712 | 50,161 | +23,449 |
| French tokens | 23,739 | 43,646 | +19,907 |
Sample
Euh mí u tóóshә́ ŋwәt ru, nzíé yúá yʉ́ә́ u púá' yírә́ nә́. | Que tu te présentes, tu dis ce que tu es.
Í nzie Li shá? | Veut-il parler de mon nom ?
Li shú, nә nguu yúá u púá' yírә́ nә́, nә́ nguu yúá u yí nә́. | Ton nom, et tout ce que tu es, et tout ce que tu sais.
Ndǔ lʉ́m mú, ká u lá' nә́, tә njʉ́ nә́. | Ton âge, comme tu as vécu, dans le monde.
Mbúá' NJI FIFEN Andre Kʉ́ә́ndap | Je suis NJI FIFEN ANDRE Kouendap …