DhoNam: Dholuo Speech dataset

License icon

License:

NOODL-1.0

Shield icon

Steward:

Maseno Centre for Applied Artificial Intelligence (MCAAI)

Task: ASR

Release Date: 12/20/2025

Format: WEBM

Size: 2.49 GB


Share

Description

DhoNam: Dholuo Speech dataset is a speech corpus designed to supercharge Automatic Speech Recognition (ASR) and other speech technologies for Dholuo, one of Kenya’s major indigenous languages. This dataset contains native-speaker audio recordings collected through a platform where users read aloud a displayed sentence. The dataset includes the audio recordings and the corresponding prompt/sentence that was read.

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

You agree that you will not re-host or re-share this dataset

Forbidden Usage

You agree not to attempt to determine the identity of speakers in this dataset Any attempt to clone the voice or train models that imitate the speakers in this dataset is forbidden

Processes

Ethical Review

How informed consent was obtained: All contributors were provided with information about the process of data collection, the intended use of the data and the license for which the data was to be released. The participants had to agree that they have understood and consent to being part of the contributors

Intended Use

This dataset is intended for use in creating automatic speech recognition systems.

Metadata

DhoNam Dholuo Speech dataset – Technical Datasheet

Overview

The DhoNam: Dholuo Speech dataset is a speech corpus designed to supercharge Automatic Speech Recognition (ASR) and other speech technologies for Dholuo, one of Kenya’s major indigenous languages. Spoken by 4.2–5 million people in Kenya and parts of Tanzania, Dholuo is one of the most widely spoken Nilotic languages.

This dataset contains native-speaker audio recordings collected through a platform where users read aloud a displayed sentence.

This dataset was created with support from the Mozilla Foundation and GIZ FAIR Forward.

Language Description

  • Name: Dholuo (also Luo)

  • Family: Nilo-Saharan → Nilotic → Western Nilotic → Luo languages

  • ISO 639-2 Code: luo

  • Dialects: Milambo and Nyanduat

Where Dholuo is Spoken

Predominantly in western Kenya:

  • Kisumu

  • Siaya

  • Homa Bay

  • Migori

  • Parts of Kericho, Nandi, and Nakuru

  • Smaller communities in northern Tanzania

About the Language

  • Tonal, but tone is not marked in writing

  • Uses a Latin-based writing system

  • Rich oral tradition: storytelling, folklore, and songs

Source Type

  • Read speech (user reads a sentence displayed on the platform)

⚠️ Both audio recordings and texts are stored.

Contributors

  • Native Dholuo speakers from Dholuo-speaking counties

  • Adults aged 19+ with informed consent

  • Balanced gender distribution where possible

🕒 Collection Timeframe

  • Collected in 2025, between October and November 2025, as part of the Dholuo Voice Data Collection Project

🎧 Recording Conditions

  • Indoor environments with no background noise

  • Recorded via smartphones through a web interface

Domains Represented

  • General

  • Agriculture

  • Technology and robotics

  • Healthcare

  • News and current affairs

These domains reflect real spoken Dholuo usage.

📊 Dataset Size

  • Total duration: 184,838.28 sec (3080.64 min, 51.34 hr)

  • Number of Speakers: 59

  • Number of Reviewers: 7

  • Total audio files: 26,091

  • Average clip length: 7.08 sec

  • Minimum clip length: 1.72 sec

  • Maximum clip length: 62.64 sec

Gender Distribution

  • Female: 34 (57.6%) - 15,279 (58.6%)

  • Male: 25 (42.4%) - 10,812 (41.4%)

Age Distribution

  • 19–29: 28 (47.5%) - 13,332 (51.1%)

  • 30–39: 17 (28.8%) - 6,469 (24.8%)

  • 40–49: 10 (16.9%) - 4,781 (18.3%)

  • 50–59: 4 (6.8%) - 1,509 (5.8%)

  • 60+: 0 (0.0%) - 0 (0.0%)

Dialect Distribution

  • Milambo: 36 (61.0%) - 16,697 (64.0%)

  • Nyanduat: 23 (39.0%) - 9,394 (36.0%)

Contributor Metadata

FieldDescription
contributor_idSpeaker code
Contributor namecontributor’s full names
ID_Numberunique contributor ID
genderM / F
age_group18–29, 30–39, etc
locationKisumu / Siaya etc
ConstituencyKisumu West etc
Education levelTertiary / Graduate
Employment statusStudent / Self-Employed / Employed
dialectMilambo / Nyanduat
consentConfirmed informed consent
LicenseNOODL Description

Data Structure

FieldDescription
idUnique file ID
speaker_idSpeaker code
genderM / F
age_group18–29, 30–39, etc
countyKisumu, etc
durationAudio length in sec
prompt_idID of the sentence shown
domainGeneral / Agriculture / Technology and robotics / Healthcare / News and current affairs

Licensing

This dataset is released under the Nwulite Obodo Open Data License (NOODL).

Info Sheet Purpose & Intended Audience

Preamble:
Much of African culture and heritage is rooted in orality. Thus the preservation of culture is predicated on the preservation, use, and mainstreaming of local languages. This is why licensing African language datasets is vital to ensure that data creation, sharing, and use uphold the rights and interests of local communities.

Well-defined licenses promote transparency and fairness by specifying how data can be accessed, reused, or commercialized, while safeguarding cultural knowledge, linguistic heritage, benefit sharing, and collective ownership, while also preventing exploitation by external actors and ensuring that the benefits are equitably shared.

The Dholuo speech dataset was created by participants from the Dholuo language community in Kenya, supported by researchers from Maseno Centre for Applied Artificial Intelligence (MCAAI), Maseno University, and researchers from the Center for Intellectual Property and Information Technology Law (CIPIT), Strathmore University.

Through a consultative process and with the support of CIPIT and MCAAI, community representatives helped determine the most suitable licensing framework and enhanced their understanding of intellectual property principles and the nuances differentiating various licensing options. The representatives were also involved in the creation of this dataset which was created between October 2025 and November 2025 with support from Mozilla Foundation and GIZ FAIR Forward.

We extend a profound and sincere nod of appreciation to the dedicated Dholuo community, the meticulous data collectors, and the visionary license creators who made this pathway possible.

License Terms

The Dholuo Speech dataset is licensed under the Nwulite Obodo Open Data License (NOODL). View the license at NOODL License. It is applied to the DhoNam: Dholuo speech dataset, covering adult speakers from the Kenyan Dholuo community.

Honoring the Dholuo Heritage Through Responsible Data Use:
When you use this dataset, you become part of a meaningful partnership with the Dholuo community members who contributed to creating this valuable resource. Acknowledge these dedicated contributors in your work, helping to celebrate and strengthen their rich cultural identity while ensuring that any applications of this data honor and preserve the Dholuo language and traditions for future generations.

Geographical Precincts:

  • Users from developing countries are invited to share materials created with this dataset under a license similar to NOODL.

  • Users from developed countries are encouraged to establish partnerships that bring tangible benefits to Dholuo community contributors:

    • Language Vitality & Learning: Support digital tools, apps, and resources to learn, practice, and use Dholuo.

    • Technology & Translation Tools: Share translation software, AI applications, and language processing tools.

    • Economic Empowerment: Support community-led projects that create jobs and business opportunities.

    • Educational Access & Growth: Provide educational materials, scholarships, and digital resources for literacy and development.

These reciprocal partnerships ensure that technological advancement serves innovation while giving back to the contributors who made this dataset possible.

Intended Use

  • Training ASR models for Dholuo

  • STT research

  • TTS training

  • Linguistic documentation and analysis

⚠️ Limitations

  • No child speech included

📬 Contact

  • Project: Dholuo Voice Data Collection Project

  • Year: 2025

  • Organization: Maseno Centre for Applied Artificial Intelligence(MCAAI)