DhoNam: Dholuo Speech dataset
License:
NOODL-1.0
Steward:
Maseno Centre for Applied Artificial Intelligence (MCAAI)
Task: ASR
Release Date: 12/20/2025
Format: WEBM
Size: 2.49 GB
Description
DhoNam: Dholuo Speech dataset is a speech corpus designed to supercharge Automatic Speech Recognition (ASR) and other speech technologies for Dholuo, one of Kenya’s major indigenous languages. This dataset contains native-speaker audio recordings collected through a platform where users read aloud a displayed sentence. The dataset includes the audio recordings and the corresponding prompt/sentence that was read.
Specifics
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseConsiderations
Restrictions/Special Constraints
You agree that you will not re-host or re-share this dataset
Forbidden Usage
You agree not to attempt to determine the identity of speakers in this dataset Any attempt to clone the voice or train models that imitate the speakers in this dataset is forbidden
Processes
Ethical Review
How informed consent was obtained: All contributors were provided with information about the process of data collection, the intended use of the data and the license for which the data was to be released. The participants had to agree that they have understood and consent to being part of the contributors
Intended Use
This dataset is intended for use in creating automatic speech recognition systems.
Metadata
🎤 DhoNam Dholuo Speech dataset – Technical Datasheet
🌍 Overview The DhoNam: Dholuo Speech dataset is a speech corpus designed to supercharge Automatic Speech Recognition (ASR) and other speech technologies for Dholuo, one of Kenya’s major indigenous languages. Spoken by 4.2–5 million people in Kenya and parts of Tanzania, Dholuo is one of the most widely spoken Nilotic languages. This dataset contains native-speaker audio recordings collected through a platform where users read aloud a displayed sentence. Importantly, the dataset includes the audio recordings and the corresponding to the prompt/sentence that was read. This dataset was created with support from Mozilla Foundation and GIZ FAIR Forward.
🗣 Language Description Language Name: Dholuo (also Luo) Language Family: Nilo-Saharan → Nilotic → Western Nilotic → Luo languages ISO 639-2 Code: luo Dialects: Milambo and Nyanduat Where Dholuo is Spoken Predominantly in western Kenya: Kisumu County Siaya County Homa Bay County Migori County Parts of Kericho, Nandi, and Nakuru Smaller communities in northern Tanzania About the Language Tonal, but tone is not marked in writing Uses a Latin-based writing system Rich oral tradition: storytelling, folklore, and songs
📂 Source Type Read speech (user reads a sentence displayed on the platform)
⚠️ Both audio recordings and texts are stored.
🙋 Contributors Native Dholuo speakers from Dholuo-speaking counties Adults aged 19+ with informed consent Balanced gender distribution where possible
🕒 Collection Timeframe Collected in 2025, between October and November 2025, as part of the Dholuo Voice Data Collection Project
🎧 Recording Conditions Indoor environments with no background noise Recorded via smartphones through a web interface Domains Represented General Agriculture Technology and robotics Healthcare News and current affairs These domains reflect real spoken Dholuo usage.
📊 Dataset Size
Total duration: 184838.28 seconds (3080.64 minutes, 51.34 hours)
Number of speakers: 59
Number of Validators/Reviewers: 7
Total audio files: 26091
Average clip length: 7.08 seconds
Minimum clip length: 1.72 seconds
Maximum clip length: 62.64 seconds
Gender Distribution Female speakers: 34 (57.6%) - 15279 recordings (58.6%) Male speakers: 25 (42.4%) - 10812 recordings (41.4%)
Age Distribution 19–29: 28 speakers (47.5%) - 13332 recordings (51.1%) 30–39: 17 speakers (28.8%) - 6469 recordings (24.8%) 40–49: 10 speakers (16.9%) - 4781 recordings (18.3%) 50–59: 4 speakers (6.8%) - 1509 recordings (5.8%) 60+: 0 speakers (0.0%) - 0 recordings (0.0%)
Dialect Distribution Milambo: 36 speakers (61.0%) - 16697 recordings (64.0%) Nyanduat: 23 speakers (39.0%) - 9394 recordings (36.0%)
Contributor Metadata Field Description contributor_id Pseudonymous speaker code Contributor name contributor’s full names ID_Number unique contributor identifier gender Male / Female age_group 18–29, 30–39, 40–49, 50–59, 60+ location Kisumu / Siaya / Homa Bay / Migori Constituency Kisumu West/Karachuonyo Education level Tertiary / Graduate / Employment status Student / Self-Employed / Unemployed / Employed dialect Milambo/Nyanduat consent Confirmed informed consent License NOODL Description
Data Structure Field Description id Unique audio file ID speaker_id Pseudonymous speaker code gender Male / Female age_group 18–29, 30–39, 40–49, 50–59, 60+ county Kisumu / Siaya / Homa Bay / Migori duration Audio length in seconds prompt_id ID of the sentence shown domain General / Agriculture / Technology_robotics / Healthcare / News_current_affairs
📜 Licensing This dataset is released under the Nwulite Obodo Open Data License (NOODL). The info sheet for the license is provided. https://docs.google.com/document/d/1v7pE0yk_N18JpyZW5WXSejpsgsTfo2yqLCRPDDGwuoY/edit?usp=drivesdk
🎯 Intended Use Training ASR models for Dholuo Speech-to-text research Text-to-speech training Linguistic documentation and analysis
⚠️ Limitations No child speech included
📬 Contact Project: Dholuo Voice Data Collection Project Year: 2025 Organization: Maseno Centre for Applied Artificial Intelligence (MCAAI) Principal Investigator: Lilian D.A Wanzare Email: mcaailab@gmail.com
