Bulu_ALCAM-MultimodalDataset

License:

NOODL-1.0

Steward:

Institute of African Digital Humanities

Task: NLP

Release Date: 4/7/2026

Format: MP3, TSV

Size: 31.28 MB

Description

ALCAM-Bulu-MultimodalDataset is a richly curated, multimodal linguistic dataset dedicated to the documentation and technological enhancement of the Bulu language, a Bantu language spoken in southern Cameroon. The dataset comprises three closely aligned components: (i) a structured datasheet containing carefully selected example sentences and lexical entries reflecting usage in the Bulu language; (ii) high-quality audio recordings of these sentences and lexical items, produced by a native speaker; and (iii) an explicit audio–sentence mapping file enabling precise alignment between the textual and acoustic data. The dataset's primary added value lies in its explicit focus on Bulu, a language which, despite having a significant speaker community and a rich literary tradition, remains severely under-represented in digital language resources and speech technology infrastructures. The dataset captures the phonological, morphological and lexical properties of Bulu through a structured elicitation methodology, and is designed to serve as a foundational resource for both linguistic analysis and the development of speech technology applications. From a methodological perspective, the dataset is designed to bridge the gap between language documentation and language technology. The parallel availability of text in Bulu alongside aligned speech makes the dataset suitable for a wide range of applications, including automatic speech recognition (ASR), text-to-speech (TTS), forced alignment, pronunciation modelling, and multimodal language learning tools. The structured datasheet further supports linguistic analysis, morphological study, contrastive studies with related varieties of the Beti-Fang group, and pedagogical uses in teacher training and language revitalisation contexts. More broadly, the ALCAM-Bulu-MultimodalDataset exemplifies an approach to African language resources that highlights socially embedded linguistic practice, phonological precision, and community-based documentation.

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

By downloading this dataset, you agree: - To use it for research and scientific use only - That you will not re-host or re-share this dataset

Forbidden Usage

You agree not to use the data for: determining the identity of the speaker in the dataset; attempting to clone the voice or train models that imitate the speaker in this dataset; Generative AI; reproduction; duplication; modification; augmentation; copying; distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the legal owner of the dataset.

Processes

Intended Use

(a) Speech-related tasks: - Automatic speech recognition (ASR): The audio–text alignment enables evaluation of speech recognition models for Bulu. It should be noted that the sentences are transcribed in IPA rather than in one of the existing orthographic standards for Bulu (the Protestant orthography, the Catholic orthography inspired by François Pichon's 1950 model, or the General Alphabet of Cameroon's Languages). Users wishing to use the dataset for ASR with a standard orthographic representation will need to apply an IPA-to-orthography mapping. - Text-to-speech (TTS): The dataset contains clean word- and sentence-level audio pairs and can be used for training or evaluating speech synthesis models for Bulu. The same orthographic caveat as noted above applies. - Speech–text alignment / forced alignment benchmarking: The structured pairing of audio and IPA transcriptions provides a useful ground truth for evaluating phoneme- or word-level aligners, particularly those targeting Bantu languages with complex tonal and morphological systems. (b) Linguistic and lexicographic tasks: - Morphological analysis: The systematic encoding of singular–plural noun pairs makes the dataset particularly valuable for computational morphology, noun class modelling, and Bantu morphological analysis. - Lexicon building: The dataset contributes to the development of lexical resources (dictionaries, morphological analysers, part-of-speech taggers) for Bulu. - Tonal analysis: The explicit IPA tone marking across all entries supports phonological research on the Bulu tone system, including studies of lexical tone, grammatical tone, and contour tones. - Contrastive and typological studies: As a member of the Beti-Fang language cluster, the dataset can be used for comparative studies with related varieties such as Ewondo, Fang, Beti, and Yezoum. (c) Language revitalisation and education: - The dataset supports language revitalisation efforts for Bulu and can contribute to the development of community-oriented language learning tools, pedagogical resources for teacher trainers, and multilingual educational applications in Cameroon and the broader Beti-Fang speech community.

Metadata

Language

Bulu (also written Boulou) is a Bantu language spoken in southern Cameroon. The Administrative Atlas of Cameroon's Languages (Breton and Bikia Fohtung, 1991) classifies Bulu as part of the Beti-Fang dialect cluster. However, the Bulu-speaking community generally contests this classification, asserting that Bulu constitutes a distinct language rather than a mere dialect of an overarching Beti-Fang macro-linguistic entity. This position is reinforced by the existence of a well-established, autonomous literary tradition in Bulu dating back to at least the early 20th century, most notably represented by Jean Louis Njemba Medu's Nnanga Kon (1939), a landmark of Bulu prose literature.

Bulu is spoken primarily in four administrative departments of Cameroon's South Region: the Ntem Division, the Mvila Division, the Dja-and-Lobo Division, and the Océan Division. Speakers are found in the major towns of these divisions, including Ebolowa (headquarters of the Mvila Division), Sangmelima (headquarters of the Dja-and-Lobo Division), Ambam, and Kribi.

Variants

The Bulu-speaking community generally identifies with two broad speech areas, sometimes described as dialects:

The Ebolowa speech area, centred on Ebolowa, the administrative and commercial hub of the Mvila Division. This variety is widely regarded as a reference form of Bulu and has been the basis for much of the written production and missionary linguistic work on the language.
The Sangmelima speech area, centred on Sangmelima, headquarters of the Dja-and-Lobo Division. This variety shares the core phonological and grammatical system of the Ebolowa speech area while exhibiting some lexical and phonetic differences.

Both speech areas are mutually intelligible and share the same broad linguistic structure. The variety represented in this dataset was elicited through a structured questionnaire targeting core lexical and grammatical categories, and is consistent with the standard Bulu used in existing written resources.

Writing System

The writing system used for the transcription of Bulu in this dataset is the International Phonetic Alphabet (IPA), as reflected in the lexical entries and sentence-level examples in the datasheet. The phonological inventory described below is derived directly from the attested forms in the dataset.

1. Vowels

The vowel system attested in the dataset is as follows:

i, e, ɛ, a, ɔ, o, u, ə

These vowels occur both in short and long forms, and with distinct tone markings in lexical items and running text (e.g. mə́ndíp 'water', àyàp 'bird', ènə̀ 'it is', ə́lí 'child').

A nasalised vowel is also attested in the dataset, indicated by a subscript tilde: ə̰ (e.g. ə̰̀nkɔ́t, ə̰̀dzálán).

Vowel length is indicated in the dataset by a colon following a vowel symbol (e.g. é:nə̀, ɲɔ̌:ŋ).

The diphthong bwì is also attested (e.g. àbwì).

2. Consonants

The consonant inventory reflected in the dataset includes the following simple, prenasalized, palatal, and affricate consonants:

b, d, dz, f, g, k, kp, l, m, mb, mf, mv, n, nd, ng, ngb, nj, ŋ, p, r, s, t, tʃ, v, w, y, z, ɲ

Palatalisation is marked with a superscript ʸ (e.g. bìtʃʸè).

Additional consonant symbols attested include:

Prenasalised stops and fricatives: mb, nd, ng, ngb, nj, mf, mv
Doubly articulated stop: kp
Affricate: tʃ (and its palatalised variant tʃʸ)
Palatal nasal: ɲ
Velar nasal: ŋ

An apostrophe is used in some orthographic representations of Bulu (e.g. in the Protestant and Catholic orthographies) to mark glottal stops or syllable boundaries, as seen in the existing Bulu literary tradition (e.g. Njemba Medu's Nnanga Kon). However, the IPA-based transcription used in this dataset does not use the apostrophe for this purpose.

3. Tone system

The dataset encodes lexical and grammatical contrastive tones, marked directly on vowels and on sonorant consonants. The following tonal categories are attested in the dataset:

High tone (H): á, é, ɛ́, í, ó, ɔ́, ú, ə́
Low tone (L): à, è, ɛ̀, ì, ò, ɔ̀, ù, ə̀
Rising contour tone (LH): ǎ, ě, ǐ, ǒ, ɔ̌, ǔ, ə̌
Falling contour tone (HL): â, ê, î, ô, ɔ̂, û, ə̂

These tonal contrasts are phonemically significant and grammatically functional in Bulu, distinguishing both lexical items and grammatical categories including noun class agreement.

A note on orthographic conventions: the existing standard orthographies for Bulu — the Protestant orthography and the Catholic orthography (inspired by the model laid out by François Pichon in 1950), as well as the General Alphabet of Cameroon's Languages — do not systematically mark tone. The IPA-based transcription employed in this dataset represents a more phonologically explicit encoding intended for linguistic and speech technology applications.

4. Noun class system

The dataset reflects Bulu's Bantu noun class system, visible in the morphological alternations between singular and plural forms of nouns. These are clearly encoded in the paired entries of the datasheet, for example:

àyàp (singular) / mə̀yàp (plural) — 'bird(s)'
ènə̀ (singular prefix class) / mìnɛ́n (plural prefix class) — cf. ànɛ́n / mə̀nɛ́n
èdɔ́k (singular) / bìdɔ́k (plural) — e.g. 'bone(s)'
ètùn (singular) / bìtùn (plural)

This morphological patterning is central to the Bantu structure of Bulu and makes the dataset particularly valuable for morphological modelling and noun class analysis.

Source

The dataset was collected through a structured questionnaire designed to elicit the core lexicon and grammatical properties of the Bulu language. This was carried out as part of the Atlas Linguistique du Cameroun (ALCAM) project. The questionnaire targets basic vocabulary domains (body parts, animals, natural elements, actions, spatial and temporal notions) and elicits both singular and plural forms of nouns, as well as example sentences illustrating usage in context. The audio recordings were made by a native Bulu speaker.

Domain

The dataset represents a linguistic questionnaire designed to elicit the basic lexicon and grammatical information of the Bulu language. The content reflects elicited speech rather than spontaneous or naturalistic discourse, and covers core vocabulary domains including body parts, animals, natural elements, objects, actions, and basic grammatical constructions. The dataset also includes paired singular–plural noun forms, which are central to the morphological structure of Bulu as a Bantu language.

Size

The dataset comprises three components:

1. Datasheet (ALCAM_dataset_Bulu.tsv): 389 rows across 19 columns, covering 221 unique lexical entries. Of these, 271 rows contain example sentences in Bulu with French equivalents, and 260 rows include morpheme-level glosses. Parts of speech represented include nouns (n), verbs (v), adjectives (adj), numerals (num), pronouns (prn), prepositions (prep), and adverbs (adv).

2. Audio recordings: 384 audio clips in MP3 format, distributed across 2 folders:

Folder 1 (Bulu_tts_dataset_384clips_2379s_20260406-1522_part1of2): 200 audio files — 11 minutes 18 seconds
Folder 2 (Bulu_tts_dataset_384clips_2379s_20260406-1522_part2of2): 184 audio files — 12 minutes 29 seconds

GRAND TOTAL: 384 audio files — 23 minutes 47 seconds

Individual audio clips are short, typically ranging from 1 to 9 seconds in duration, consistent with the elicited word- and sentence-level prompts of the questionnaire.

3. Audio–text mapping file: a tab-separated file with 183 lines and 4 columns, linking each audio file to its Bulu transcription.

A detailed breakdown of durations per file is provided in the accompanying duration report.

Structure

The dataset comprises three components:

1. Datasheet (ALCAM_dataset_Bulu.tsv): a tab-separated file with 389 rows and 19 columns. Each row corresponds to a lexical entry (singular or plural form) or an associated example sentence. The columns are:

OrigID: original number of the lexical entry in the paper questionnaire
EditID: any modification of OrigID
FrenchRef: the reference entry as originally provided in French
FrenchComm: original researcher comments about the reference entry
French: the lexical entry in French
Note: researcher notes on the lexical entry
POS: part of speech (n = noun, v = verb, adj = adjective, num = numeral, prn = pronoun, prep = preposition, adv = adverb)
Class: noun class (where applicable)
Morf: morphological attribute (sg = singular, pl = plural)
Var: variant information (where applicable)
Word: the lexical entry in Bulu (IPA transcription)
CrossRef: cross-reference to related lexical entry numbers
FrenchEx: example sentence in French
LangEx: example sentence in Bulu (IPA transcription)
LangExEdit: manually edited version of LangEx
LangPars: word-for-word morpheme parsing in Bulu
LangParsEdit: edited version of LangPars
FrenchPars: French equivalent of LangParsEdit (word-for-word gloss)
FrenchParsEdit: edited version of FrenchPars

2. Audio recordings: MP3 files, distributed across 2 folders (see Size section above).

3. Audio–text mapping file: a tab-separated file with 183 lines and 4 columns, following the format:

audio_filename.mp3 key sentence attempts

where:

audio_filename: the name of the MP3 audio file
key: a unique identifier (matching the filename without extension)
sentence: the Bulu transcription in IPA, which may contain a lexical entry alone (e.g. àyàp), or a lexical entry followed by a separator and an example sentence (e.g. àyàp ; àyàp élé ánə̀ ə́lí)
attempts: the number of recording attempts made for that entry

The dataset is designed for linguistic analysis and NLP/speech technology pipelines requiring paired audio-text data.

Sample

Datasheet sample (selected rows from ALCAM_dataset_Bulu.tsv):

OrigID	French	POS	Morf	Word	FrenchEx	LangEx
x1	bouche	n	sg	àɲù	elle a une petite bouche	à bìlí mɔ̄nə́ áɲù
x1	bouches	n	pl	mə̀ɲù	—	à bìlí mɔ̄ná áɲù
x2	oeil	n	sg	dís	les yeux voient	bə́mbə́ bə́ tùk mîs
x3	tête	n	sg	ə̰̀lō	la tête de l'oiseau est grande	àmbə́ ábìlí bə̀tà ǹló à àyàp tʃíŋ
x5	oiseau	n	sg	àyàp	l'enfant a un oiseau	àyàp élé ánə̀ ə́lí
x5	oiseaux	n	pl	mə̀yàp	—	—

Audio–mapping sample (selected rows from the mapping file):

audio_filename	sentence
4513ffb01a9f370de66aed60d0bb1cb8.mp3	èdɔ́k ; èbé é:nə̀ èdɔ́k
cba3f314894a2c8f64f5bf0f6008b02d.mp3	bìdɔ́k
5f9b1e2b0f3c6df2bc3df50069777973.mp3	ànɛ́n ; à_nɛ́n ákɔ́k ánə̀ ə́lí
5b86c54a9224ac2322531f6f0e31a9cf.mp3	mə̀nɛ́n
2dedd6ff377c84f36ea7df0615a14b2a.mp3	àyàp ; àyàp élé ánə̀ ə́lí
988e0c733e86febd61831aac13423a5e.mp3	mə̀yàp
02e477dffeaaed885cecd60a44945028.mp3	tʃótʃóáé ; mà ɲɔ̀ŋ mɔ̀nə́ ákɔ́k
de54f97eacbbbce9c93145f2d374dcc0.mp3	ndàm ; zə̌n ènə̀ ndàm
25ff7ef05b209e23cda58ee88c6fc048.mp3	ə̰̀nkɔ́t ; ə́ zə̌n ə́ɲū ènə̀ ə̰̀nkɔ́t
fa9c810fafa99d57f6292aea7c048c80.mp3	ə̰̀dzálán ; èsɔ́ énə̀ ə̰̀dzálán à mə́ndíp