Mvele_ALCAM-MultimodalDataset

License:

NOODL-1.0

Steward:

Institute of African Digital Humanities

Task: NLP

Release Date: 3/20/2026

Format: MP3, TSV

Size: 14.13 MB

Description

Mvele_ALCAM-MultimodalDataset is a richly curated, multimodal linguistic dataset dedicated to the documentation and technological enhancement of the Mvele variety of the widely designated 'Ewondo language'. Mvele is a localised and socially embedded speech form that is rarely represented in standard grammatical descriptions or lexicographical resources. The dataset comprises three closely aligned components: (i) a structured datasheet containing carefully selected example sentences reflecting casual, albeit non-authentic, usage in the Mvele variety; (ii) high-quality audio recordings of these sentences, produced by a native speaker; and (iii) an explicit audio–sentence mapping file enabling precise alignment between the textual and acoustic data. The dataset's primary added value lies in its explicit focus on the Mvele variety of Ewondo, which typically remains invisible in reference grammars, dictionaries and educational materials that often privilege standardised or prestigious varieties. The dataset captures micro-variation in phonetics, phonology, morphosyntax and lexical choice, which are essential for understanding socially situated linguistic practices rather than a homogeneous, abstract system. In this sense, the dataset contributes to a more inclusive representation of linguistic diversity. From a methodological perspective, the dataset is designed to bridge the gap between language documentation and language technology. The parallel availability of text in the Mvele variety and in French, alongside aligned speech, makes the dataset suitable for a wide range of applications, including automatic speech recognition (ASR), text-to-speech (TTS), machine translation (MT), forced alignment, pronunciation modelling and multimodal language learning tools. At the same time, the structured datasheet supports linguistic analysis, contrastive studies with other language varieties and pedagogical uses in teacher training and language revitalisation contexts. More broadly, the Mvele_ALCAM-MultimodalDataset exemplifies an approach to African language resources that highlights fluidity, longitudinal variation, orality and community-based practice.

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

By downloading this dataset, you agree: - To use it for research and scientific use only - that you will not re-host or re-share this dataset

Forbidden Usage

You agree not to use the data for: determining the identity of the speaker in the dataset; attempt to clone the voice or train models that imitate the speaker in this dataset; Generative AI; reproduction; duplication; modification; augmentation; copying; distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the the legal owner of the dataset.

Processes

Intended Use

(a) Speech-related tasks: - Automatic speech recognition (ASR): Audio–text alignment allows the evaluation of speech recognition models for Mvele. However, it should be noted that the read sentences are transcribed phonetically. There is at least one competing orthographic standard for Ewondo, the designated language of which Mvele is considered to a dialect in standard classification; the General Alphabet of Cameroon's Languages is the one that is closest to phonetic transcription. The other is the Catholic Missionaries orthography inspired by the model laid out by François Pichon in 1950. - Text-to-speech (TTS): As the dataset contains clean sentence–audio pairs, it can also be used to evaluate speech synthesis or text-to-speech models. Here again, it should be noted that the alphabet used to write the sentences is the IPA alphabet and not the General Alphabet of Cameroon's Languages, the Protestant alphabet or the Catholic alphabet. - Speech–text alignment/forced alignment benchmarking: Fine-grained, word-level segmentation provides ideal ground truth for evaluating phoneme - or word-level aligners. | (b) Translation and multilingual tasks: - Machine translation (Mvele (considered as a "dialect" of Ewondo )↔ French): The sentence-level alignment between Mvele/Ewondo and French makes it a parallel corpus for evaluating translation models with the limitations of the employed phonetic orthographic standard. - Speech translation (speech-to-text): (c) Linguistic and lexicographic tasks - Morphological analysis/glossed corpus studies: The morpheme-level glosses are valuable for computational morphology, interlinear text modelling (ILTs) and grammar induction tasks. - Lexicon and part-of-speech tagging: These are useful for building linguistic resources such as dictionaries, morphological analysers or taggers for Mvele/Ewondo.

Metadata

Other information

Language

Mvele (also written Mvelé) is a variety of the Beti macro-linguistic area belonging to the Narrow Bantu family. The are located primarily in the Centre Region of Cameroon, in the Mefou-et-Afamba and Nyong-et-Mfoumou Divisions, as well as in the Est Region, in the Lom-et-Djérem Division. The language is also referred to as bebele in some classifications.

Variants

It is difficult to determine the extent to which Mvele varies as a linguistic group. Not only is the group not recognised as a distinct language, it also shares close similarities with neighbouring groups such as Mbida-Mbani, Bene, Yengo, Moog-Nyenge and Tsinga.

Writing System

The writing system used for the transcription of Mvele in this dataset is the International Phonetic Alphabet (IPA), as reflected in lexical entries (Word) and sentence-level examples (LangEx) in the datasheet. The phonological inventory described below is derived directly from the attested forms in the LangEx and Word columns of the datasheet.

1. Vowels

The vowel system attested in the dataset is as follows:

i, e, ɛ, a, ɔ, o, u, ə

These vowels occur both with and without tone marking in lexical items and running text (e.g. mə̀ndíp 'water', àɲù 'mouth', ǹló 'head', ngə́m 'tail').

2. Consonants

The consonant inventory reflected in the dataset includes the following simple, prenasalized and affricate consonants:

b, d, dz, dʒ, f, g, h, k, l, m, mb, mv, n, nd, ng, ŋ, ŋg, ŋk, p, r, s, t, ts, v, w, y, z, ɲ, ʒ

These consonants appear consistently across noun stems, verbal forms, derivational patterns, and noun-class alternations (e.g. ǹló 'head', ngə́m 'tail', dʒóé 'nose', àɲù 'mouth', mə̀ndíp 'water').

3. Tone system

The datasheet shows lexical and grammatical contrastive tones, marked directly on vowels and on the sonorants m and n. The following tonal categories are attested in the LangEx column:

High tone (H): á, é, ɛ́, í, ó, ɔ́, ú, ə́, ń, ḿ
Low tone (L): à, è, ɛ̀, ì, ò, ɔ̀, ù, ə̀, ǹ, m̀
Falling contour tone (HL): â, ê, î, ô, ɔ̂, û, ə̂, ɛ̂
Rising contour tone (LH): ǎ, ě, ǐ, ǒ, ɔ̌, ǔ, ə̌, ɛ̌
Falling-low contour tone: attested on consonant-final syllables in a small set of lexical items (e.g. àsup᷆, ì dɔ́k᷆, sò bók᷆)

Unmarked vowels represent tonally neutral or contextually determined syllables.

Source

The dataset was collected through a questionnaire designed to gather basic information about the Mvele lexicon and grammar. This was done as part of the Atlas Linguistique du Cameroun (ALCAM) project.

Domain

The dataset represents a linguistic questionnaire designed to elicit the basic lexicon and grammatical information.

Size

Total size is approximately 14.48 MB (uncompressed).

Structure

The dataset comprises: 1) a datasheet (ALCAM_mvele_datasheet.tsv) with 990 lines and 20 columns; 2) 366 voice clips read by a single female native speaker; 3) a sentence-to-audio mapping file (audio_mapping.tsv) with 336 lines and 2 columns.

Description of columns

#OrigID: original number of lexical entry on paper questionnaire
#EditID: modification of #OrigID
#FrenchRef: reference entry (originally provided in French)
#FrenchComm: original comments about reference entry (#FrenchRef)
#French: lexical entry in French (overlaps with #FrenchRef)
#Note: note of researcher on the lexical entry
#POS: part of speech
#Class: noun class (where applicable)
#Morf: morphological attribute (e.g. plural, singular)
#Var: (na)
#Word: lexical entry in Mvele
#CrossRef: cross-referencing of lexical entry number
#FrenchEx: example sentence in French
#LangEx: example sentence in Mvele
#LangExEdit: manual editing of #LangEx
#FrenchExEdit: edited French equivalent of #FrenchEx
#LangPars: word-for-word parsing in Mvele
#LangParsEdit: editing of #LangPars
#FrenchPars: French equivalent of #LangParsEdit
#FrenchParsEdit: editing of #FrenchPars

Sample

audio files	words & sentences
ee1ad49eceb6508ef34cfe082cc0d2c9.mp3	ǹló; à mbə́bə̀ lə́ mùt ǹló é à ìáp kíŋ
d748ad689de834cd4d2f6d26c3df9bcc.mp3	m̀vòát; mì mvòát míé míá vín
b2394ad6d9575253fd22609a7df201b9.mp3	àsòŋ; ínə̀ mə̀sòŋ mə́ mvú
b349803a3f0048e27048ac0a0f5ffabc.mp3	dʒóé; à ìà dzóé
77621084e63b964cd6531f4220fa0aeb.mp3	kíŋ; kíŋ yé ì nə̀ ànɛ́n
424faf0a2b287c00fc95910da753e08e.mp3	àbɛ́; à sòp mə̀bɛ́
7b7783c89cd70133f5ec3b3026e9501a.mp3	àkàn; à sòp mə̀ kàn mə́ móán ù í
fd56b04277aa31ce568b7d83f7a3829d.mp3	àbùm; ò bə̀lə́ ábùm ǹdzálán à mə̀ndíp
4afc7ff50c518fb196c0c3c738648c7e.mp3	fól; à bə̀lə́ fól á wá
94bd4153535ea963bdc8b61d6bee0d7a.mp3	támbá; támbá îɲì ɛ́nə̀ àyàp
206a5fa92bdc4675f320d1d06e353465.mp3	tóŋ; tíd nì ùmbə́ bə̀lə́ hə̀ tóŋó díà
e1fe4a5142c672c78467f876a3ec921a.mp3	ngə́m; à líndì ngə́m mvú dʒàáŋ
03a37bb7cb782526b4e6c09be76a060e.mp3	àfàp; mə́ fàp mə́ə́ nóàn yɛ́ɛ́lí má híè
aa320d8e9f6bb0b58e167328214948c4.mp3	mìmbíɛ́n; àngə́də́k à ìà à ngə́də́k