Luhya ASR data subset 70 hours
License:
CC-BY-4.0
Steward:
Digital Divide DataTask: ASR
Release Date: 1/16/2026
Format: WAV, XLSX
Size: 13.90 GB
Share
Description
A 70-hour subset of Luhya speech data collected by Digital Divide Data in Kenya. The dataset includes recorded sentences from native speakers and is intended to support research and development in Automatic Speech Recognition for low-resource African languages.
Specifics
Licensing
Creative Commons Attribution 4.0 International (CC-BY-4.0)
https://spdx.org/licenses/CC-BY-4.0.htmlConsiderations
Restrictions/Special Constraints
For research and scientific use only. Redistribution or re-hosting of this dataset is not permitted without written permission from Digital Divide Data.
Forbidden Usage
You agree not to attempt to determine the identity of any speaker in the dataset. It is forbidden to use this dataset for voice cloning, biometric identification, surveillance, or the creation of synthetic voices mimicking speakers.
Processes
Ethical Review
All participants gave informed consent prior to recording. Speakers were briefed about the use of their audio for research and could withdraw at any time. No personal names or identifying information are included in the dataset.
Intended Use
This dataset is intended for developing and evaluating Automatic Speech Recognition (ASR) systems for under-resourced African languages, studying cross-dialect variation, training self-supervised audio models, and supporting linguistic research on Luhya dialects.
Metadata
TECHNICAL DATASHEET LUHYA 70-HOUR SPEECH DATASET DIGITAL DIVIDE DATA (KENYA)
CONSIDERATIONS
1.1 Forbidden usage
Do not attempt to identify or re-identify any speaker.
Do not link voices to real people or external datasets.
Do not use the dataset for surveillance or biometric identification.
Do not re-host, re-share, or redistribute the dataset without written permission.
1.2 Privacy and ethics
All speakers are represented only by pseudonymous IDs.
Demographic metadata is intended for aggregate analysis only.
Individual-level inference is strictly discouraged.
WHAT (DATASET OVERVIEW)
2.1 Dataset summary
A curated collection of read speech recordings from native Luhya speakers.
Collected in Kenya through controlled voice collection sessions.
Designed for low-resource language technology development.
2.2 Primary intended task
Automatic Speech Recognition (ASR) training and evaluation.
2.3 Secondary possible uses
Dialect-aware ASR experiments
Dialect classification (where labels are reliable)
Low-resource transfer learning
Educational and linguistic research (aggregate-level only)
LANGUAGE
3.1 Language name
Luhya (Bantu language cluster)
3.2 Variants / dialects covered
Bukusu
Banyala
Batsotso
Kisa
Wanga
Kabarasi
Samia
3.3 Geographic distribution
Western Kenya (Bungoma, Kakamega, Busia, Vihiga, Trans-Nzoia)
Eastern Uganda (minority communities)
WHO (DATA CREATION)
Native Luhya speakers recruited locally in Kenya.
Sentence prompts prepared by linguistically competent contributors.
Manual transcription and verification by trained contributors.
WHERE
Controlled collection sessions in Kenya.
WHEN
September 2025.
SOURCES
Generated specifically for this project.
Read speech audio paired with verified transcripts.
WAV audio format.
DOMAINS
General / Culture
Agriculture
Health
SIZE
Approximately 70 hours of transcribed speech.
STRUCTURE
Index_file_70h.xlsx
vc01/, vc02/, vc03/, vc04/ folders
Speaker-level subfolders with WAV files.
WRITING SYSTEM
Latin alphabet
Five vowels: a, e, i, o, u
Phonemic spelling, no tone marking.
INTENDED USE
ASR training and evaluation
Research and educational use only.
CITATION Digital Divide Data (2025). Luhya 70-Hour Speech Dataset (Kenya).