DhoNam: Dholuo Speech dataset
License:
NOODL-1.0
Steward:
Maseno Centre for Applied Artificial Intelligence (MCAAI)Task: ASR
Release Date: 12/20/2025
Format: WEBM
Size: 2.49 GB
Share
Description
DhoNam: Dholuo Speech dataset is a speech corpus designed to supercharge Automatic Speech Recognition (ASR) and other speech technologies for Dholuo, one of Kenya’s major indigenous languages. This dataset contains native-speaker audio recordings collected through a platform where users read aloud a displayed sentence. The dataset includes the audio recordings and the corresponding prompt/sentence that was read.
Specifics
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseConsiderations
Restrictions/Special Constraints
You agree that you will not re-host or re-share this dataset
Forbidden Usage
You agree not to attempt to determine the identity of speakers in this dataset Any attempt to clone the voice or train models that imitate the speakers in this dataset is forbidden
Processes
Ethical Review
How informed consent was obtained: All contributors were provided with information about the process of data collection, the intended use of the data and the license for which the data was to be released. The participants had to agree that they have understood and consent to being part of the contributors
Intended Use
This dataset is intended for use in creating automatic speech recognition systems.
Metadata
DhoNam Dholuo Speech dataset – Technical Datasheet
Overview
The DhoNam: Dholuo Speech dataset is a speech corpus designed to supercharge Automatic Speech Recognition (ASR) and other speech technologies for Dholuo, one of Kenya’s major indigenous languages. Spoken by 4.2–5 million people in Kenya and parts of Tanzania, Dholuo is one of the most widely spoken Nilotic languages.
This dataset contains native-speaker audio recordings collected through a platform where users read aloud a displayed sentence.
This dataset was created with support from the Mozilla Foundation and GIZ FAIR Forward.
Language Description
Name: Dholuo (also Luo)
Family: Nilo-Saharan → Nilotic → Western Nilotic → Luo languages
ISO 639-2 Code:
luoDialects: Milambo and Nyanduat
Where Dholuo is Spoken
Predominantly in western Kenya:
Kisumu
Siaya
Homa Bay
Migori
Parts of Kericho, Nandi, and Nakuru
Smaller communities in northern Tanzania
About the Language
Tonal, but tone is not marked in writing
Uses a Latin-based writing system
Rich oral tradition: storytelling, folklore, and songs
Source Type
Read speech (user reads a sentence displayed on the platform)
⚠️ Both audio recordings and texts are stored.
Contributors
Native Dholuo speakers from Dholuo-speaking counties
Adults aged 19+ with informed consent
Balanced gender distribution where possible
🕒 Collection Timeframe
Collected in 2025, between October and November 2025, as part of the Dholuo Voice Data Collection Project
🎧 Recording Conditions
Indoor environments with no background noise
Recorded via smartphones through a web interface
Domains Represented
General
Agriculture
Technology and robotics
Healthcare
News and current affairs
These domains reflect real spoken Dholuo usage.
📊 Dataset Size
Total duration: 184,838.28 sec (3080.64 min, 51.34 hr)
Number of Speakers: 59
Number of Reviewers: 7
Total audio files: 26,091
Average clip length: 7.08 sec
Minimum clip length: 1.72 sec
Maximum clip length: 62.64 sec
Gender Distribution
Female: 34 (57.6%) - 15,279 (58.6%)
Male: 25 (42.4%) - 10,812 (41.4%)
Age Distribution
19–29: 28 (47.5%) - 13,332 (51.1%)
30–39: 17 (28.8%) - 6,469 (24.8%)
40–49: 10 (16.9%) - 4,781 (18.3%)
50–59: 4 (6.8%) - 1,509 (5.8%)
60+: 0 (0.0%) - 0 (0.0%)
Dialect Distribution
Milambo: 36 (61.0%) - 16,697 (64.0%)
Nyanduat: 23 (39.0%) - 9,394 (36.0%)
Contributor Metadata
| Field | Description |
|---|---|
| contributor_id | Speaker code |
| Contributor name | contributor’s full names |
| ID_Number | unique contributor ID |
| gender | M / F |
| age_group | 18–29, 30–39, etc |
| location | Kisumu / Siaya etc |
| Constituency | Kisumu West etc |
| Education level | Tertiary / Graduate |
| Employment status | Student / Self-Employed / Employed |
| dialect | Milambo / Nyanduat |
| consent | Confirmed informed consent |
| License | NOODL Description |
Data Structure
| Field | Description |
|---|---|
| id | Unique file ID |
| speaker_id | Speaker code |
| gender | M / F |
| age_group | 18–29, 30–39, etc |
| county | Kisumu, etc |
| duration | Audio length in sec |
| prompt_id | ID of the sentence shown |
| domain | General / Agriculture / Technology and robotics / Healthcare / News and current affairs |
Licensing
This dataset is released under the Nwulite Obodo Open Data License (NOODL).
Info Sheet Purpose & Intended Audience
Preamble:
Much of African culture and heritage is rooted in orality. Thus the preservation of culture is predicated on the preservation, use, and mainstreaming of local languages. This is why licensing African language datasets is vital to ensure that data creation, sharing, and use uphold the rights and interests of local communities.
Well-defined licenses promote transparency and fairness by specifying how data can be accessed, reused, or commercialized, while safeguarding cultural knowledge, linguistic heritage, benefit sharing, and collective ownership, while also preventing exploitation by external actors and ensuring that the benefits are equitably shared.
The Dholuo speech dataset was created by participants from the Dholuo language community in Kenya, supported by researchers from Maseno Centre for Applied Artificial Intelligence (MCAAI), Maseno University, and researchers from the Center for Intellectual Property and Information Technology Law (CIPIT), Strathmore University.
Through a consultative process and with the support of CIPIT and MCAAI, community representatives helped determine the most suitable licensing framework and enhanced their understanding of intellectual property principles and the nuances differentiating various licensing options. The representatives were also involved in the creation of this dataset which was created between October 2025 and November 2025 with support from Mozilla Foundation and GIZ FAIR Forward.
We extend a profound and sincere nod of appreciation to the dedicated Dholuo community, the meticulous data collectors, and the visionary license creators who made this pathway possible.
License Terms
The Dholuo Speech dataset is licensed under the Nwulite Obodo Open Data License (NOODL). View the license at NOODL License. It is applied to the DhoNam: Dholuo speech dataset, covering adult speakers from the Kenyan Dholuo community.
Honoring the Dholuo Heritage Through Responsible Data Use:
When you use this dataset, you become part of a meaningful partnership with the Dholuo community members who contributed to creating this valuable resource. Acknowledge these dedicated contributors in your work, helping to celebrate and strengthen their rich cultural identity while ensuring that any applications of this data honor and preserve the Dholuo language and traditions for future generations.
Geographical Precincts:
Users from developing countries are invited to share materials created with this dataset under a license similar to NOODL.
Users from developed countries are encouraged to establish partnerships that bring tangible benefits to Dholuo community contributors:
Language Vitality & Learning: Support digital tools, apps, and resources to learn, practice, and use Dholuo.
Technology & Translation Tools: Share translation software, AI applications, and language processing tools.
Economic Empowerment: Support community-led projects that create jobs and business opportunities.
Educational Access & Growth: Provide educational materials, scholarships, and digital resources for literacy and development.
These reciprocal partnerships ensure that technological advancement serves innovation while giving back to the contributors who made this dataset possible.
Intended Use
Training ASR models for Dholuo
STT research
TTS training
Linguistic documentation and analysis
⚠️ Limitations
No child speech included
📬 Contact
Project: Dholuo Voice Data Collection Project
Year: 2025
Organization: Maseno Centre for Applied Artificial Intelligence(MCAAI)