Ewondo-Yanda-ALCAM-MultimodalDataset

License icon

License:

NOODL-1.0

Shield icon

Steward:

Institute of African Digital Humanities

Task: NLP

Release Date: 12/7/2025

Format: MP3, TSV

Size: 18.09 MB


Description

This dataset comprises a datasheet of Ewondo (Ewo) lexical entries collected in the Yanda subgroup. Each entry is accompanied by illustrative sentences, word-by-word glosses and French translations. The resource is enriched with aligned audio recordings, making it suitable for linguistic analysis and speech technology development.

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

The creator of this dataset has made efforts to retain the data in its original form, even where potential errors in transcription, glossing, translation or parsing have been identified. Christelle Manga Bakong of the Moog-Ebanda subgroup recorded the voice recordings of lexical entries and example sentences. Users of this dataset may wish to adjust the transcription to the voice recording at their own risk.

Forbidden Usage

Generative AI, reproduction, duplication, modification, augmentation, copying, distribution, transmission, display, sale, transfer, publication or creation of derivative works without the explicit permission of the point of contact acting on behalf of the legal owner of the dataset.

Processes

Intended Use

(a) Speech-related tasks: - Automatic speech recognition (ASR): Audio–text alignment allows the evaluation of speech recognition models for Ewondo. However, it should be noted that the read sentences are transcribed phonetically. There is at least one competing orthographic standard for Ewondo; the General Alphabet of Cameroon's Languages is the one that is closest to phonetic transcription. The other is the Catholic Missionaries orthography inspired by the model laid out by François Pichon in 1950. - Text-to-speech (TTS): As the dataset contains clean sentence–audio pairs, it can also be used to evaluate speech synthesis or text-to-speech models. Here again, it should be noted that the alphabet used to write the sentences is the IPA alphabet and not the Catholic alphabet. - Speech–text alignment/forced alignment benchmarking: Fine-grained, word-level segmentation provides ideal ground truth for evaluating phoneme - or word-level aligners. | (b) Translation and multilingual tasks: - Machine translation (Ewondo ↔ French): The sentence-level alignment between Ewondo and French makes it a parallel corpus for evaluating translation models with the limitations of the employed phonetic orthographic standard. - Speech translation (speech-to-text): (c) Linguistic and lexicographic tasks - Morphological analysis/glossed corpus studies: The morpheme-level glosses are valuable for computational morphology, interlinear text modelling (ILTs) and grammar induction tasks. - Lexicon and part-of-speech tagging: These are useful for building linguistic resources such as dictionaries, morphological analysers or taggers for Ewondo.

Metadata

Language

Ewondo is a Narrow Bantu language which is indigenous to a population mainly located in the Centre Region of Cameroon, with pockets of settlements in the South, and East Regions. Ewondo is vehicular to populations in the South and East Regions of Cameroon, and has also developed into a creole known as Mongo Ewondo;

Variants

The term 'Ewondo' is used to describe a set of linguistic varieties whose speakers may or may not identify with the term. This is partly due to the structures of linguistic governance. In Cameroon, a nationwide linguistic survey was undertaken in the second half of the 1970s and the first half of the 1980s as part of the Atlas Linguistique du Cameroun project. The survey resulted in the publication of the Atlas of Cameroonian Languages, also referred to as the Administrative Atlas of Cameroonian Languages. In this work, a macro-language called Beti-Fang is identified, with Ewondo being one of the major micro-languages alongside Fang, Bulu, Ntumu and Eton. Other subgroups speaking varieties that differ to a greater or lesser extent have often been subsumed under one of the more prominent Beti-Fang micro languages. Consequently, it is very difficult to determine with confidence, based on which variables, a particular linguistic variety can be categorised as Ewondo without distorting reality. For this reason, the author of this dataset has deemed it worthwhile to refer to the specific geographical locations or particular subgroups in which the data presented in this dataset was collected. This dataset was collected in the Yanda subgroup, found in Yaoundé and other settlements outside Yaoundé.

Writing System

The writing system used for the transcription of Ewondo in this dataset is the International Phonetic Alphabet (IPA)

1. Vowels

The vowel system is as follows: i, e, ɛ, a, ɔ, o, u, ə

2. Consonants

Simple consonants: b, d, dz, f, g, ɣ, h, k, l, m, mb, mf, mv, n, ɲ, ŋ, nd, ŋg, ŋk, ndz, r, p, s, t, ts, v, w, y, z

3. Tone system

The datasheet shows a tone system with several marked tone categories:

  • High tone (H): á, é, í, ó, ú, ɛ́, ɔ́, ḭ́, etc.

  • Low tone (L): à, è, ì, ò, ù, ɛ̀, ɔ̀, ǹ, etc.

  • Falling contour tone (HL): â, ê, î, ô, û, ɛ̂, ɔ̂, ɲ̂, etc.

  • Rising contour tone (LH): ǎ, ě, ǒ, ǔ, ɔ̌, ɛ̌, etc.

  • Mid / level tone (from downstep/upstep): ā, ē, ī, ō, ū, ɛ̄, ɔ̄

  • ↓ marks downstep (lowered high)

  • ↑ marks upstep (raised high)

Source

The dataset originates from a questionnaire designed to gather basic information about the Ewondo lexicon and grammar within the framework of the Atlas Linguistique du Cameroun (ALCAM) project.

Domain

The dataset represents a linguistic questionnaire designed to elicit the basic lexicon and grammatical information.

Size

Total size is 19,5 MB

Structure

The dataset comprises: 1) a datasheet with 542 lines and 19 columns; 2) 505 voice clips read by a single female native speaker; 3) sentence-to-audio mapping with 504 lines and two columns.

Description of columns
  • #OrigID: original number of lexical entry on paper questionnaire

  • #EditID: modification of #OrigID

  • #FrenchRef: reference entry (originally provided in French)

  • #FrenchComm: Original comments about reference entry (#FrenchRef)

  • #French: Lexical entry in French (overlaps with #FrenchRef)

  • #Note: note of researcher on the lexical entry

  • #POS: part of speech

  • #Class: noun class (where applicable)

  • #Morph: morphological attribute (ex. plural, singular)

  • #Var: (na)

  • #Word: Lexical entry in Ewondo, Yanda variety

  • #CrossRef: Cross-referencing of lexical entry number

  • #FrenchEx: Example sentence in French

  • #LangEx: Example sentence in Ewondo

  • #LangExEdit: manual editing of #LangEx

  • #LangPars: word for word parsing in Ewondo

  • #LangParsEdit: editing of #LangPars

  • #FrenchPars: French equivalent of #LangParsEdit -#FrenchParsdit: editing of #FrenchPars

Sample

OrigIDEditIDFrenchRefFrenchCommFrenchNotePOSClassMorfVarWordCrossRefFrenchExLangExLangExEditLangParsLangParsEditFrenchParsFrenchParsEdit
x1_bouche_bouche_nsg_àɲù_elle a une petite boucheà bə̀lə́ mǎ↓n áɲù| à| bə̀lə́ | mǎ↓n | áɲù |_| Elle | a | une | petite | bouche |_