Bamun-French Parallel Corpus
License:
NOODL-1.0
Steward:
Institute of African Digital Humanities
Task: MT
Release Date: 12/24/2025
Format: TSV
Size: 99.24 KB
Description
This dataset is a parallel corpus of Bamun (Shupamem) to French texts. Text were obtained by transcription of raw audio files. Translation were added to enrich the original corpus. Alignment of Bamun and French texts were made in the process of creating this dataset.
Specifics
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseConsiderations
Restrictions/Special Constraints
By downloading this dataset, you agree: - To use it for research and scientific use only - To contribute to the development of machine translation systems for the Bamun language - that you will not re-host or re-share this dataset
Forbidden Usage
You agree not to use the data for: - Determining the identity of the speakers in the dataset - Generative AI; reproduction; duplication; modification; augmentation; copying; distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the legal owner of the dataset.
Processes
Intended Use
This dataset is intended for the training and testing of machine translation models for the Bamun language.
Metadata
Language
Bamun or Shüpamom/Shupamem is a Bantu-Grassfield language spoken in the Noun Divison, West Region in Cameroon.
Variants
The Bamun language is quite homogeneous within their indigenous territory, the Noun Administrative Division. However, the Administrative Atlas of Cameroon's Languages (Breton and Bikia Fohtung, 1991) indicates a few "islands" outside the Noun Department where the Bamun language exhibits minor variations. These include Bapi in the Mifi Division in the West Region and Bamalang and Bangolan in the Mezam Division in the Northwest Region.
Writing System
1. Vowels
The vowel inventory reflected in the dataset is: i, e, ɛ, a, ɔ, o, u, ʉ, ə The vowel ə / ә is particularly frequent and functions as a central vowel.
2. Consonants
The consonant system includes the following simple consonants: b, d, f, g, h, j, k, l, m, n, ŋ, p, r, s, t, v, w, y, z Complex and cluster-like consonants attested include: mb, nd, nk, ng, nt, nj, mf, kp Digraphs: sh, gh
3. Tone system
The transcription encodes lexical tone using diacritics, corresponding to standard tonal categories:
High tone (H): marked with acute accent (á, é, ɔ́, ʉ́, ŋ́)
Low tone (L): marked with grave accent (à, è, ɔ̀)
Mid tone (M): marked with macron (ā, ē)
Rising tone (LH): marked with caron (ǎ, ě, ɔ̌)
Falling tone (HL): marked with circumflex (â, ê)
Source
This dataset originates from audio recordings documenting personal histories of German colorization. These recordings were made in the early eighties as part of a research project led by Prince (Professor) Koum A Ndoumbe III.
Abdou Salam Ntieche Fifen created the transcriptions and French translations in this dataset. The transcriptions were made in 2017 under the coordination of the AfricAvenir Foundation. For the purpose of creating this dataset, Abdou Salam Ntieche Fifen has aligned and quality-checked the transcribed text and its translations.
Domain
This dataset is a transcription of prompted speech in the form of a directed interview. The aim of the interview was to elicit personal stories about the German colonial experience in Cameroon. Similar interviews were conducted in many other languages and locations across Cameroon.
Size
99.24 KB
Structure
This parallel corpus comprises 2,301 lines, each consisting of a translation unit in both the source and target languages. The Bamun source text has 26,712 tokens, while the French target text has 23,739.
Sample
U púá' shí kuot yú « certificat » ú ngɛ́t kʉә ? Qu’as-tu fait lorsque tu as obtenu ton certificat ?
Pә́ ka pí yíé'rә́ wʉ́n tuə́t lerәwǎ ? Avez-vous appris à écrire ?
Pә́ ka pә́ nsáá saangaam nә́ pʉn nә́ ? Nkʉnsaansaa ? On vous donnait les informations ? Les contes ?
Tʉtʉn, pü ka pә́ ngúón má ndalerәwa shi kuan. Beaucoup, on partait à l'école sans manquer
Pә́ ka pә́ mbә́ túә́ ta' wʉ́n pә́ ghɛt kʉ́ә́ nә́ pʉn nә ? Qu’est-ce qu’on vous faisait quand on vous rattrapait ?
