DhoNam: Dholuo Speech dataset
License:
NOODL-1.0
Steward:
Maseno Centre for Applied Artificial Intelligence (MCAAI)
Task: ASR
Release Date: 12/20/2025
Format: WEBM
Size: 2.49 GB
Share
Description
DhoNam: Dholuo Speech dataset is a speech corpus designed to supercharge Automatic Speech Recognition (ASR) and other speech technologies for Dholuo, one of Kenya’s major indigenous languages. This dataset contains native-speaker audio recordings collected through a platform where users read aloud a displayed sentence. The dataset includes the audio recordings and the corresponding prompt/sentence that was read.
Specifics
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseConsiderations
Restrictions/Special Constraints
You agree that you will not re-host or re-share this dataset
Forbidden Usage
You agree not to attempt to determine the identity of speakers in this dataset Any attempt to clone the voice or train models that imitate the speakers in this dataset is forbidden
Processes
Ethical Review
How informed consent was obtained: All contributors were provided with information about the process of data collection, the intended use of the data and the license for which the data was to be released. The participants had to agree that they have understood and consent to being part of the contributors
Intended Use
This dataset is intended for use in creating automatic speech recognition systems.
Metadata
🎤 DhoNam Dholuo Speech dataset – Technical Datasheet
🌍 Overview
The DhoNam: Dholuo Speech dataset is a speech corpus designed to supercharge Automatic Speech Recognition (ASR) and other speech technologies for Dholuo, one of Kenya’s major indigenous languages. Spoken by 4.2–5 million people in Kenya and parts of Tanzania, Dholuo is one of the most widely spoken Nilotic languages.
This dataset contains native-speaker audio recordings collected through a platform where users read aloud a displayed sentence.
This dataset was created with support from the Mozilla Foundation and GIZ FAIR Forward.
🗣 Language Description
Name: Dholuo (also Luo) Family: Nilo-Saharan → Nilotic → Western Nilotic → Luo languages ISO 639-2 Code: luo Dialects: Milambo and Nyanduat Where Dholuo is Spoken Predominantly in western Kenya:
Kisumu Siaya Homa Bay Migori Parts of Kericho, Nandi, and Nakuru Smaller communities in northern Tanzania
About the Language
Tonal, but tone is not marked in writing Uses a Latin-based writing system Rich oral tradition: storytelling, folklore, and songs
📂 Source Type
Read speech (user reads a sentence displayed on the platform) ⚠️ Both audio recordings and texts are stored.
🙋 Contributors
Native Dholuo speakers from Dholuo-speaking counties Adults aged 19+ with informed consent Balanced gender distribution where possible
🕒 Collection Timeframe
Collected in 2025, between October and November 2025, as part of the Dholuo Voice Data Collection Project
🎧 Recording Conditions
Indoor environments with no background noise Recorded via smartphones through a web interface
Domains Represented
General Agriculture Technology and robotics Healthcare News and current affairs These domains reflect real spoken Dholuo usage.
📊 Dataset Size
Total duration: 184,838.28 sec (3080.64 min, 51.34 hr) Number of Speakers: 59 Number of Reviewers: 7 Total audio files: 26,091 Average clip length: 7.08 sec Minimum clip length: 1.72 sec Maximum clip length: 62.64 sec
Gender Distribution
Female: 34 (57.6%) - 15,279 (58.6%) Male: 25 (42.4%) - 10,812 (41.4%)
Age Distribution
19–29: 28 (47.5%) - 13,332 (51.1%) 30–39: 17 (28.8%) - 6,469 (24.8%) 40–49: 10 (16.9%) - 4,781 (18.3%) 50–59: 4 (6.8%) - 1,509 (5.8%) 60+: 0 (0.0%) - 0 (0.0%)
Dialect Distribution
Milambo: 36 (61.0%) - 16,697 (64.0%) Nyanduat: 23 (39.0%) - 9,394 (36.0%)
Contributor Metadata
Field Description contributor_id Speaker code Contributor name contributor’s full names ID_Number unique contributor ID gender M / F age_group 18–29, 30–39, etc location Kisumu / Siaya etc Constituency Kisumu West etc Education level Tertiary / Graduate Employment status Student / Self-Employed / Employed dialect Milambo / Nyanduat consent Confirmed informed consent License NOODL Description
Data Structure
Field Description id Unique file ID speaker_id Speaker code gender M / F age_group 18–29, 30–39, etc county Kisumu, etc duration Audio length in sec prompt_id ID of the sentence shown domain General / Agriculture / Technology and robotics / Healthcare / News and current affairs
📜 Licensing
This dataset is released under the Nwulite Obodo Open Data License (NOODL).
Info Sheet Purpose & Intended Audience Preamble:
Much of African culture and heritage is rooted in orality. Thus the preservation of culture is predicated on the preservation, use, and mainstreaming of local languages. This is why licensing African language datasets is vital to ensure that data creation, sharing, and use uphold the rights and interests of local communities.
Well-defined licenses promote transparency and fairness by specifying how data can be accessed, reused, or commercialized, while safeguarding cultural knowledge, linguistic heritage, benefit sharing, and collective ownership, while also preventing exploitation by external actors and ensuring that the benefits are equitably shared.
The Dholuo speech dataset was created by participants from the Dholuo language community in Kenya, supported by researchers from Maseno Centre for Applied Artificial Intelligence (MCAAI), Maseno University, and researchers from the Center for Intellectual Property and Information Technology Law (CIPIT), Strathmore University.
Through a consultative process and with the support of CIPIT and MCAAI, community representatives helped determine the most suitable licensing framework and enhanced their understanding of intellectual property principles and the nuances differentiating various licensing options. The representatives were also involved in the creation of this dataset which was created between October 2025 and November 2025 with support from Mozilla Foundation and GIZ FAIR Forward.
We extend a profound and sincere nod of appreciation to the dedicated Dholuo community, the meticulous data collectors, and the visionary license creators who made this pathway possible.
License Terms
The Dholuo Speech dataset is licensed under the Nwulite Obodo Open Data License (NOODL). View the license at NOODL License. It is applied to the DhoNam: Dholuo speech dataset, covering adult speakers from the Kenyan Dholuo community.
Honoring the Dholuo Heritage Through Responsible Data Use: When you use this dataset, you become part of a meaningful partnership with the Dholuo community members who contributed to creating this valuable resource. Acknowledge these dedicated contributors in your work, helping to celebrate and strengthen their rich cultural identity while ensuring that any applications of this data honor and preserve the Dholuo language and traditions for future generations.
Geographical Precincts:
Users from developing countries are invited to share materials created with this dataset under a license similar to NOODL. Users from developed countries are encouraged to establish partnerships that bring tangible benefits to Dholuo community contributors: Language Vitality & Learning: Support digital tools, apps, and resources to learn, practice, and use Dholuo. Technology & Translation Tools: Share translation software, AI applications, and language processing tools. Economic Empowerment: Support community-led projects that create jobs and business opportunities. Educational Access & Growth: Provide educational materials, scholarships, and digital resources for literacy and development. These reciprocal partnerships ensure that technological advancement serves innovation while giving back to the contributors who made this dataset possible.
🎯 Intended Use Training ASR models for Dholuo STT research TTS training Linguistic documentation and analysis
⚠️ Limitations No child speech included
📬 Contact Project: Dholuo Voice Data Collection Project Year: 2025 Organization: Maseno Centre for Applied Artificial Intelligence (MCAAI) Principal Investigator: Lilian D.A Wanzare Email: mcaailab@gmail.com