Basaa-ALCAM-MultimodalDataset

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

The dataset was created to promote NLP research and voice technology development in Basaa. All users are welcome to download it, but they are requested to contact the dataset owner by email before use, to discuss how their work can support the development of the Basaa language.

Forbidden Usage

Generative AI, reproduction, duplication, modification, augmentation, copying, distribution, transmission, display, sale, transfer, publication or creation of derivative works without the explicit permission of the point of contact acting on behalf of the legal owner of the dataset.

Processes

Intended Use

(a) Speech-related tasks: - Automatic speech recognition (ASR): Audio–text alignment allows the evaluation of speech recognition models for Basaa. However, I should be noted that the read sentences are transcribed phonetically. There are competing orthographic standards for Basaa; the General Alphabet of Cameroon's Languages is the one that is closest to phonetic transcription. The other two most widely used orthographic standards are those of the Protestant and Catholic missionaries. - Text-to-speech (TTS): As the dataset contains clean sentence–audio pairs, it can also be used to evaluate speech synthesis or text-to-speech models. Here again, it should be noted that the alphabet used to write the sentences is the IPA alphabet and not the General Alphabet of Cameroon's Languages, the Protestant alphabet or the Catholic alphabet. - Speech–text alignment/forced alignment benchmarking: Fine-grained, word-level segmentation provides ideal ground truth for evaluating phoneme - or word-level aligners. (b) Translation and multilingual tasks: - Machine translation (Basaa ↔ French): The sentence-level alignment between Basaa and French makes it a parallel corpus for evaluating translation models with the limitations of the employed phonetic orthographic standard. - Speech translation (speech-to-text): (c) Linguistic and lexicographic tasks - Morphological analysis/glossed corpus studies: The morpheme-level glosses are valuable for computational morphology, interlinear text modelling (ILTs) and grammar induction tasks. - Lexicon and part-of-speech tagging: These are useful for building linguistic resources such as dictionaries, morphological analysers or taggers for Basaa.

Metadata

Language

Basaa is a narrow Bantu language spoken across a geographical area spanning three administrative regions in Cameroon: the Centre, Littoral and South regions. It is estimated that there are currently around 600,000–700,000 speakers. This figure includes different varieties, as well as diasporic populations who identify as Basaa speakers.

The vitality of the Basaa language is stable (Ethnologue online). However, intergenerational transmission of Basaa is increasingly threatened among parents aged 50 and under, particularly in urban areas.

Although Basaa is taught in schools, this does not significantly impact the vitality of the language, mainly due to the current pedagogical approach, which relies on rule-based and descriptivist teaching methods.

Variants

The glossonym 'Basaa' is a generic term that encompasses a range of varieties, the speakers of which may identify with the 'Basaa' label to varying degrees, depending on a complex set of geographical, social, political, situational and pragmatic factors. Whether a language variant is considered Basaa depends greatly on the perspective of the person 'telling the story'. Some of the most commonly acknowledged varieties of Basaa include:

Mbene
Bikok
Babimbi
Basaa ba Omeng
Basaa ba Yabasi Basaa ba Duala
Ndog-Bikim

Other varieties, such as Ndonga, Mbaa (also known as Mbay-Bati) and Hijuk, may also be classified as Basaa. However, as previously mentioned, not everyone agrees on this classification.

Writing System

The writing system used for the transcription of Basaa in this dataset is the International Phonetic Alphabet (IPA)

1. Vowels

The vowel system is as follows: i, e, ɛ, a, ɔ, o, u

2. Consonants

Simple consonants: p, b, ɓ, c, d, g, h, j, k, l, m, mb, n, nd, ŋg, ŋgw, ny, ŋ s, t, y, w.

3. Tone system

High tone (H): á, é, í, ó, ú, ɛ́, ɔ́ Low tone (L): à, è, ì, ò, ù, ɛ̀, ɔ̀ Falling tone (HL): â, ê, î, ô, û, ɛ̂, ɔ̂ Mid tone (mostly the result of downstep and upstep): ā, ē, ī, ō, ū, ɛ̄, ɔ̄

Source

The dataset originates from a questionnaire designed to elicit general information about the Basaa (BAS) lexicon and grammar, focusing on how a particular subgroup known as the Mbene uses the language. The questionnaire formed part of a nationwide research project known as the Atlas Linguistique du Cameroun (ALCAM), which was part of a larger programme called the Atlas Linguistique d'Afrique Centrale (ALAC), funded by the Agence de Coopération Scientifique et Technique (ACCT) of the French government. The project was carried out by the Centre de Recherche et de Documentation sur les Traditions et les Langues Africaines (CERDOTOLA) in partnership with the Direction Générale de la Recherche Scientifique et Technique (DGRST) of Cameroon's Ministry of Scientific and Technical Research in the late 1970s and early to mid 1980s. The original paper questionnaire, from which the information in this dataset was extracted, was created by Henri Marcel Bôt Ba Njock, who was a professor of linguistics, the director of the Centre de Recherches Africanistes and the head of the department of African languages and cultures at the time. The questionnaire was developed without the use of an informant, as the researcher was a native speaker of Basaa, specifically the Mbene dialect spoken in the Makak sub-division of the Nyong-and-Kelle division in Cameroon's Centre Province (now the Centre Region).

Domain

The dataset represents a linguistic questionnaire designed to elicit the basic lexicon and grammatical information.

Size

Total size is 15,5 MB

Structure

The dataset comprises three items: a datasheet, a audio recordings and sentence-to-audio mappings. The datasheet contains 350 lines and 16 columns (A to P). The audio recordings consist of 336 voice clips read by a single male native speaker of Basaa, aged around 50. The sentence-to-audio mapping file contains 336 lines and two columns.

Sample

OrigID	EditID	FrenchRef	FrenchComm	French	Note	POS	Class	Morf	Var	Word	CrossRef	FrenchEx	LangEx	LangPars	FrenchPars
x1	_	bouche	_	bouche		n	3	sg	_	nyɔ̀	142	elle a une petite bouche	à gwèé nyɔ̀ ǹtítígí	a \| gweé \| nyɔ \| ǹtítígí	il/elle \| avoir-présent \| bouche-singulier \| petite