DhoNam: Dholuo Speech dataset | Mozilla Data Collective

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

You agree that you will not re-host or re-share this dataset

Forbidden Usage

You agree not to attempt to determine the identity of speakers in this dataset Any attempt to clone the voice or train models that imitate the speakers in this dataset is forbidden

Processes

Ethical Review

How informed consent was obtained: All contributors were provided with information about the process of data collection, the intended use of the data and the license for which the data was to be released. The participants had to agree that they have understood and consent to being part of the contributors

Intended Use

This dataset is intended for use in creating automatic speech recognition systems.

Metadata

DhoNam Dholuo Speech dataset – Technical Datasheet

Overview

The DhoNam: Dholuo Speech dataset is a speech corpus designed to supercharge Automatic Speech Recognition (ASR) and other speech technologies for Dholuo, one of Kenya’s major indigenous languages. Spoken by 4.2–5 million people in Kenya and parts of Tanzania, Dholuo is one of the most widely spoken Nilotic languages.

This dataset contains native-speaker audio recordings collected through a platform where users read aloud a displayed sentence.

This dataset was created with support from the Mozilla Foundation and GIZ FAIR Forward.

Language Description

Name: Dholuo (also Luo)
Family: Nilo-Saharan → Nilotic → Western Nilotic → Luo languages
ISO 639-2 Code: luo
Dialects: Milambo and Nyanduat

Where Dholuo is Spoken

Predominantly in western Kenya:

Kisumu
Siaya
Homa Bay
Migori
Parts of Kericho, Nandi, and Nakuru
Smaller communities in northern Tanzania

About the Language

Tonal, but tone is not marked in writing
Uses a Latin-based writing system
Rich oral tradition: storytelling, folklore, and songs

Source Type

Read speech (user reads a sentence displayed on the platform)

⚠️ Both audio recordings and texts are stored.

Contributors

Native Dholuo speakers from Dholuo-speaking counties
Adults aged 19+ with informed consent
Balanced gender distribution where possible

🕒 Collection Timeframe

Collected in 2025, between October and November 2025, as part of the Dholuo Voice Data Collection Project

🎧 Recording Conditions

Indoor environments with no background noise
Recorded via smartphones through a web interface

Domains Represented

General
Agriculture
Technology and robotics
Healthcare
News and current affairs

These domains reflect real spoken Dholuo usage.

📊 Dataset Size

Total duration: 184,838.28 sec (3080.64 min, 51.34 hr)
Number of Speakers: 59
Number of Reviewers: 7
Total audio files: 26,091
Average clip length: 7.08 sec
Minimum clip length: 1.72 sec
Maximum clip length: 62.64 sec

Gender Distribution

Female: 34 (57.6%) - 15,279 (58.6%)
Male: 25 (42.4%) - 10,812 (41.4%)

Age Distribution

19–29: 28 (47.5%) - 13,332 (51.1%)
30–39: 17 (28.8%) - 6,469 (24.8%)
40–49: 10 (16.9%) - 4,781 (18.3%)
50–59: 4 (6.8%) - 1,509 (5.8%)
60+: 0 (0.0%) - 0 (0.0%)

Dialect Distribution

Milambo: 36 (61.0%) - 16,697 (64.0%)
Nyanduat: 23 (39.0%) - 9,394 (36.0%)

Contributor Metadata

Field	Description
contributor_id	Speaker code
Contributor name	contributor’s full names
ID_Number	unique contributor ID
gender	M / F
age_group	18–29, 30–39, etc
location	Kisumu / Siaya etc
Constituency	Kisumu West etc
Education level	Tertiary / Graduate
Employment status	Student / Self-Employed / Employed
dialect	Milambo / Nyanduat
consent	Confirmed informed consent
License	NOODL Description

Data Structure

Field	Description
id	Unique file ID
speaker_id	Speaker code
gender	M / F
age_group	18–29, 30–39, etc
county	Kisumu, etc
duration	Audio length in sec
prompt_id	ID of the sentence shown
domain	General / Agriculture / Technology and robotics / Healthcare / News and current affairs

Licensing

This dataset is released under the Nwulite Obodo Open Data License (NOODL).

Info Sheet Purpose & Intended Audience

Preamble:
Much of African culture and heritage is rooted in orality. Thus the preservation of culture is predicated on the preservation, use, and mainstreaming of local languages. This is why licensing African language datasets is vital to ensure that data creation, sharing, and use uphold the rights and interests of local communities.

Well-defined licenses promote transparency and fairness by specifying how data can be accessed, reused, or commercialized, while safeguarding cultural knowledge, linguistic heritage, benefit sharing, and collective ownership, while also preventing exploitation by external actors and ensuring that the benefits are equitably shared.

The Dholuo speech dataset was created by participants from the Dholuo language community in Kenya, supported by researchers from Maseno Centre for Applied Artificial Intelligence (MCAAI), Maseno University, and researchers from the Center for Intellectual Property and Information Technology Law (CIPIT), Strathmore University.

Through a consultative process and with the support of CIPIT and MCAAI, community representatives helped determine the most suitable licensing framework and enhanced their understanding of intellectual property principles and the nuances differentiating various licensing options. The representatives were also involved in the creation of this dataset which was created between October 2025 and November 2025 with support from Mozilla Foundation and GIZ FAIR Forward.

We extend a profound and sincere nod of appreciation to the dedicated Dholuo community, the meticulous data collectors, and the visionary license creators who made this pathway possible.

License Terms

The Dholuo Speech dataset is licensed under the Nwulite Obodo Open Data License (NOODL). View the license at NOODL License. It is applied to the DhoNam: Dholuo speech dataset, covering adult speakers from the Kenyan Dholuo community.

Honoring the Dholuo Heritage Through Responsible Data Use:
When you use this dataset, you become part of a meaningful partnership with the Dholuo community members who contributed to creating this valuable resource. Acknowledge these dedicated contributors in your work, helping to celebrate and strengthen their rich cultural identity while ensuring that any applications of this data honor and preserve the Dholuo language and traditions for future generations.

Geographical Precincts:

Users from developing countries are invited to share materials created with this dataset under a license similar to NOODL.
Users from developed countries are encouraged to establish partnerships that bring tangible benefits to Dholuo community contributors:
- Language Vitality & Learning: Support digital tools, apps, and resources to learn, practice, and use Dholuo.
- Technology & Translation Tools: Share translation software, AI applications, and language processing tools.
- Economic Empowerment: Support community-led projects that create jobs and business opportunities.
- Educational Access & Growth: Provide educational materials, scholarships, and digital resources for literacy and development.

These reciprocal partnerships ensure that technological advancement serves innovation while giving back to the contributors who made this dataset possible.

Intended Use

Training ASR models for Dholuo
STT research
TTS training
Linguistic documentation and analysis

⚠️ Limitations

No child speech included

📬 Contact

Project: Dholuo Voice Data Collection Project
Year: 2025
Organization: Maseno Centre for Applied Artificial Intelligence(MCAAI)