Akoose-ALCAM-MultimodalDataset

License icon

License:

NOODL-1.0

Shield icon

Steward:

Institute of African Digital Humanities

Task: NLP

Release Date: 12/10/2025

Format: MP3, TSV

Size: 16.05 MB


Description

This dataset comprises a datasheet of Akoose (bss) lexical entries collected from the 'Western-Bakossi' subgroup. Each entry is accompanied by illustrative sentences, French translations and a word-by-word breakdown of the Akoose sentences, as well as an equivalent breakdown in English. The resource is enriched with aligned audio recordings, making it ideal for linguistic analysis and the development of speech technology.

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

The creator of this dataset has made efforts to retain the data in its original form, even where potential errors in transcription, glossing, translation or word-for-word breakdown have been identified. Dione Zoun Ornella Kelly who originates from Tombel recorded the voice recordings of lexical entries and example sentences. Users of this dataset may wish to adjust the transcription to the voice recording at their own risk.

Forbidden Usage

Generative AI, reproduction, duplication, modification, augmentation, copying, distribution, transmission, display, sale, transfer, publication or creation of derivative works without the explicit permission of the the legal owner of the dataset.

Processes

Intended Use

(a) Speech-related tasks: - Automatic speech recognition (ASR): Audio–text alignment allows the evaluation of speech recognition models for Akoose. However, it should be noted that the read sentences are transcribed phonetically. There is at least one competing orthographic standard for Akoose; the General Alphabet of Cameroon's Languages is the one that is closest to phonetic transcription. - Text-to-speech (TTS): As the dataset contains clean sentence–audio pairs, it can also be used to evaluate speech synthesis or text-to-speech models. Here again, it should be noted that the alphabet used to write the sentences is the IPA alphabet and not the General Alphabet of Cameroon's Languages. - Speech–text alignment/forced alignment benchmarking: Fine-grained, word-level segmentation provides ideal ground truth for evaluating phoneme - or word-level aligners. | (b) Translation and multilingual tasks: - Machine translation (Akkose ↔ French/Akkose ↔ English): The sentence-level alignment between Akkose, French and English makes it a parallel corpus for evaluating translation models with the limitations of the employed phonetic orthographic standard. - Speech translation (speech-to-text): (c) Linguistic and lexicographic tasks - Morphological analysis/glossed corpus studies: The morpheme-level glosses are valuable for computational morphology, interlinear text modelling (ILTs) and grammar induction tasks. - Lexicon and part-of-speech tagging: These are useful for building linguistic resources such as dictionaries, morphological analysers or taggers for Akoose.

Metadata

Language

Akoose is a Bantu language belonging to the Mangenguba group also known as the Mbo Cluster. The language is spoken in Cameroon, in the Southwest region, Kupe-Manenguba Division, Bangem Sub-division.

Variants

The Administrative Atlas of Cameroon's Languages (Breton and Bikia Fohtung, 1991) lists three varieties of Akoose: Mwamboŋ, Nnɛnoŋ, and Nloŋ. However, Solomon Ngome, the researcher who administered the questionnaire from which some of the material in this dataset was taken, refers to the variety on which the questionnaire is based as Western Bakossi.

Writing System

1. Vowels:

i, e, ɛ, a, ɔ, o, u, ə, ɪ, ʊ (plus long vowels i:, e:, a:, etc.)

2. Consonants:

p, b, t, d, k, g, f, v, s, z, h, l, r, m, n, ŋ, ɲ, w, j

  • Prenasalized: mb, mp, mf, mv, nd, nt, ndz, nk, ŋg, ŋk, ǹd, ǹz, ǹj, etc.

  • Glottalized: p’, t’, k’, b’, d’, g’

  • Unreleased stops: p̚, t̚, k̚, b̚, d̚, g̚

  • Fricatives: β, ɣ

  • Affricates: ts, dz

3. Tones:
  • H: á

  • L: à

  • M: ā

  • HL: â

  • LH: ǎ

Source

The dataset originates from a questionnaire designed to gather basic information about the Akoose lexicon and grammar within the framework of the Atlas Linguistique du Cameroun (ALCAM) project.

Domain

The dataset represents a linguistic questionnaire designed to elicit the basic lexicon and grammatical information.

Size

Total size is 17,3 MB

Structure

The dataset comprises: 1) a datasheet with 374 lines and 15 columns; 2) 305 voice clips read by a single female native speaker; 3) sentence-to-audio mapping with 305 lines and two columns.

Description of columns
  • #OrigID: original number of lexical entry on paper questionnaire

  • #IDEdit: modification of #OrigID

  • #FrenchRef: reference entry (originally provided in French)

  • #FrenchComm: Original comments about reference entry (#FrenchRef)

  • #French: Lexical entry in French (overlaps with #FrenchRef)

  • #Note: note of researcher on the lexical entry

  • #POS: part of speech

  • #Class: noun class (where applicable)

  • #Morph: morphological attribute (ex. plural, singular)

  • #Word: Lexical entry in Ewondo, Yanda variety

  • #CrossRef: Cross-referencing of lexical entry number

  • #FrenchEx: Example sentence in French

  • #LangEx: Example sentence in Ewondo

  • #LangPars: word for word parsing in Ewondo

  • #EnglishPars: English equivalent of #LangParsEdit -#EnglishParsdit: editing of #EnglishPars

Sample

OrigIDEditIDFrenchRefFrenchCommFrenchNotePOSClassMorfVarWordCrossRefFrenchExLangExLangExEditLangParsLangParsEditEnglishAlignFrenchAligEdit
x1_bouche_bouche_sg_ǹsyə̀l142elle a une petite boucheà wó mwǎ mpén mé ǹsyə̀là| wó | mwǎ mpén| mé| ǹsyə̀l3rd per| has| small | of | mouth___