ddd-kenya-luhya-70hrs-asr

License:

CC-BY-4.0

Steward:

Digital Divide Data

Task: ASR

Release Date: 1/16/2026

Format: WAV, XLSX, TSV

Size: 13.90 GB

Description

A 70-hour subset of Luhya speech data collected by Digital Divide Data in Kenya. The dataset includes recorded sentences from native speakers and is intended to support research and development in Automatic Speech Recognition for low-resource African languages.

Specifics

Licensing

Creative Commons Attribution 4.0 International (CC-BY-4.0)

https://spdx.org/licenses/CC-BY-4.0.html

Considerations

Restrictions/Special Constraints

For research and scientific use only. Redistribution or re-hosting of this dataset is not permitted without written permission from Digital Divide Data.

Forbidden Usage

You agree not to attempt to determine the identity of any speaker in the dataset. It is forbidden to use this dataset for voice cloning, biometric identification, surveillance, or the creation of synthetic voices mimicking speakers.

Processes

Ethical Review

All participants gave informed consent prior to recording. Speakers were briefed about the use of their audio for research and could withdraw at any time. No personal names or identifying information are included in the dataset.

Intended Use

This dataset is intended for developing and evaluating Automatic Speech Recognition (ASR) systems for under-resourced African languages, studying cross-dialect variation, training self-supervised audio models, and supporting linguistic research on Luhya dialects.

Metadata

TECHNICAL DATASHEET LUHYA 70-HOUR SPEECH DATASET DIGITAL DIVIDE DATA (KENYA)

CONSIDERATIONS

1.1 Forbidden usage

Do not attempt to identify or re-identify any speaker.
Do not link voices to real people or external datasets.
Do not use the dataset for surveillance or biometric identification.
Do not re-host, re-share, or redistribute the dataset without written permission.

1.2 Privacy and ethics

All speakers are represented only by pseudonymous IDs.
Demographic metadata is intended for aggregate analysis only.
Individual-level inference is strictly discouraged.

WHAT (DATASET OVERVIEW)

2.1 Dataset summary

A curated collection of read speech recordings from native Luhya speakers.
Collected in Kenya through controlled voice collection sessions.
Designed for low-resource language technology development.

2.2 Primary intended task

Automatic Speech Recognition (ASR) training and evaluation.

2.3 Secondary possible uses

Dialect-aware ASR experiments
Dialect classification (where labels are reliable)
Low-resource transfer learning
Educational and linguistic research (aggregate-level only)

LANGUAGE

3.1 Language name

Luhya (Bantu language cluster)

3.2 Variants / dialects covered

Bukusu
Banyala
Batsotso
Kisa
Wanga
Kabarasi
Samia

3.3 Geographic distribution

Western Kenya (Bungoma, Kakamega, Busia, Vihiga, Trans-Nzoia)
Eastern Uganda (minority communities)

WHO (DATA CREATION)

Native Luhya speakers recruited locally in Kenya.
Sentence prompts prepared by linguistically competent contributors.
Manual transcription and verification by trained contributors.

WHERE

Controlled collection sessions in Kenya.

WHEN

September 2025.

SOURCES

Generated specifically for this project.
Read speech audio paired with verified transcripts.
WAV audio format.

DOMAINS

General / Culture
Agriculture
Health

SIZE

Approximately 70 hours of transcribed speech.

STRUCTURE

Index_file_70h.xlsx
vc01/, vc02/, vc03/, vc04/ folders
Speaker-level subfolders with WAV files.

WRITING SYSTEM

Latin alphabet
Five vowels: a, e, i, o, u
Phonemic spelling, no tone marking.

INTENDED USE

ASR training and evaluation
Research and educational use only.

CITATION Digital Divide Data (2025). Luhya 70-Hour Speech Dataset (Kenya).