Akoose-ALCAM-MultimodalDataset
License:
NOODL-1.0
Steward:
Institute of African Digital Humanities
Task: NLP
Release Date: 12/10/2025
Format: MP3, TSV
Size: 16.05 MB
Description
This dataset comprises a datasheet of Akoose (bss) lexical entries collected from the 'Western-Bakossi' subgroup. Each entry is accompanied by illustrative sentences, French translations and a word-by-word breakdown of the Akoose sentences, as well as an equivalent breakdown in English. The resource is enriched with aligned audio recordings, making it ideal for linguistic analysis and the development of speech technology.
Specifics
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseConsiderations
Restrictions/Special Constraints
The creator of this dataset has made efforts to retain the data in its original form, even where potential errors in transcription, glossing, translation or word-for-word breakdown have been identified. Dione Zoun Ornella Kelly who originates from Tombel recorded the voice recordings of lexical entries and example sentences. Users of this dataset may wish to adjust the transcription to the voice recording at their own risk.
Forbidden Usage
Generative AI, reproduction, duplication, modification, augmentation, copying, distribution, transmission, display, sale, transfer, publication or creation of derivative works without the explicit permission of the the legal owner of the dataset.
Processes
Intended Use
(a) Speech-related tasks: - Automatic speech recognition (ASR): Audio–text alignment allows the evaluation of speech recognition models for Akoose. However, it should be noted that the read sentences are transcribed phonetically. There is at least one competing orthographic standard for Akoose; the General Alphabet of Cameroon's Languages is the one that is closest to phonetic transcription. - Text-to-speech (TTS): As the dataset contains clean sentence–audio pairs, it can also be used to evaluate speech synthesis or text-to-speech models. Here again, it should be noted that the alphabet used to write the sentences is the IPA alphabet and not the General Alphabet of Cameroon's Languages. - Speech–text alignment/forced alignment benchmarking: Fine-grained, word-level segmentation provides ideal ground truth for evaluating phoneme - or word-level aligners. | (b) Translation and multilingual tasks: - Machine translation (Akkose ↔ French/Akkose ↔ English): The sentence-level alignment between Akkose, French and English makes it a parallel corpus for evaluating translation models with the limitations of the employed phonetic orthographic standard. - Speech translation (speech-to-text): (c) Linguistic and lexicographic tasks - Morphological analysis/glossed corpus studies: The morpheme-level glosses are valuable for computational morphology, interlinear text modelling (ILTs) and grammar induction tasks. - Lexicon and part-of-speech tagging: These are useful for building linguistic resources such as dictionaries, morphological analysers or taggers for Akoose.
Metadata
Language
Akoose is a Bantu language belonging to the Mangenguba group also known as the Mbo Cluster. The language is spoken in Cameroon, in the Southwest region, Kupe-Manenguba Division, Bangem Sub-division.
Variants
The Administrative Atlas of Cameroon's Languages (Breton and Bikia Fohtung, 1991) lists three varieties of Akoose: Mwamboŋ, Nnɛnoŋ, and Nloŋ. However, Solomon Ngome, the researcher who administered the questionnaire from which some of the material in this dataset was taken, refers to the variety on which the questionnaire is based as Western Bakossi.
Writing System
1. Vowels:
i, e, ɛ, a, ɔ, o, u, ə, ɪ, ʊ (plus long vowels i:, e:, a:, etc.)
2. Consonants:
p, b, t, d, k, g, f, v, s, z, h, l, r, m, n, ŋ, ɲ, w, j
Prenasalized: mb, mp, mf, mv, nd, nt, ndz, nk, ŋg, ŋk, ǹd, ǹz, ǹj, etc.
Glottalized: p’, t’, k’, b’, d’, g’
Unreleased stops: p̚, t̚, k̚, b̚, d̚, g̚
Fricatives: β, ɣ
Affricates: ts, dz
3. Tones:
H: á
L: à
M: ā
HL: â
LH: ǎ
Source
The dataset originates from a questionnaire designed to gather basic information about the Akoose lexicon and grammar within the framework of the Atlas Linguistique du Cameroun (ALCAM) project.
Domain
The dataset represents a linguistic questionnaire designed to elicit the basic lexicon and grammatical information.
Size
Total size is 17,3 MB
Structure
The dataset comprises: 1) a datasheet with 374 lines and 15 columns; 2) 305 voice clips read by a single female native speaker; 3) sentence-to-audio mapping with 305 lines and two columns.
Description of columns
#OrigID: original number of lexical entry on paper questionnaire
#IDEdit: modification of #OrigID
#FrenchRef: reference entry (originally provided in French)
#FrenchComm: Original comments about reference entry (#FrenchRef)
#French: Lexical entry in French (overlaps with #FrenchRef)
#Note: note of researcher on the lexical entry
#POS: part of speech
#Class: noun class (where applicable)
#Morph: morphological attribute (ex. plural, singular)
#Word: Lexical entry in Ewondo, Yanda variety
#CrossRef: Cross-referencing of lexical entry number
#FrenchEx: Example sentence in French
#LangEx: Example sentence in Ewondo
#LangPars: word for word parsing in Ewondo
#EnglishPars: English equivalent of #LangParsEdit -#EnglishParsdit: editing of #EnglishPars
Sample
| OrigID | EditID | FrenchRef | FrenchComm | French | Note | POS | Class | Morf | Var | Word | CrossRef | FrenchEx | LangEx | LangExEdit | LangPars | LangParsEdit | EnglishAlign | FrenchAligEdit |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| x1 | _ | bouche | _ | bouche | _ | sg | _ | ǹsyə̀l | 142 | elle a une petite bouche | à wó mwǎ mpén mé ǹsyə̀l | à| wó | mwǎ mpén| mé| ǹsyə̀l | 3rd per| has| small | of | mouth | _ | _ | _ |
