Luhya ASR data subset 70 hours

License icon

License:

CC-BY-4.0

Shield icon

Steward:

Digital Divide Data

Task: ASR

Release Date: 1/16/2026

Format: WAV, XLSX

Size: 13.90 GB


Share

Description

A 70-hour subset of Luhya speech data collected by Digital Divide Data in Kenya. The dataset includes recorded sentences from native speakers and is intended to support research and development in Automatic Speech Recognition for low-resource African languages.

Specifics

Licensing

Creative Commons Attribution 4.0 International (CC-BY-4.0)

https://spdx.org/licenses/CC-BY-4.0.html

Considerations

Restrictions/Special Constraints

For research and scientific use only. Redistribution or re-hosting of this dataset is not permitted without written permission from Digital Divide Data.

Forbidden Usage

You agree not to attempt to determine the identity of any speaker in the dataset. It is forbidden to use this dataset for voice cloning, biometric identification, surveillance, or the creation of synthetic voices mimicking speakers.

Processes

Ethical Review

All participants gave informed consent prior to recording. Speakers were briefed about the use of their audio for research and could withdraw at any time. No personal names or identifying information are included in the dataset.

Intended Use

This dataset is intended for developing and evaluating Automatic Speech Recognition (ASR) systems for under-resourced African languages, studying cross-dialect variation, training self-supervised audio models, and supporting linguistic research on Luhya dialects.

Metadata

TECHNICAL DATASHEET LUHYA 70-HOUR SPEECH DATASET DIGITAL DIVIDE DATA (KENYA)

  1. CONSIDERATIONS

1.1 Forbidden usage

  • Do not attempt to identify or re-identify any speaker.

  • Do not link voices to real people or external datasets.

  • Do not use the dataset for surveillance or biometric identification.

  • Do not re-host, re-share, or redistribute the dataset without written permission.

1.2 Privacy and ethics

  • All speakers are represented only by pseudonymous IDs.

  • Demographic metadata is intended for aggregate analysis only.

  • Individual-level inference is strictly discouraged.

  1. WHAT (DATASET OVERVIEW)

2.1 Dataset summary

  • A curated collection of read speech recordings from native Luhya speakers.

  • Collected in Kenya through controlled voice collection sessions.

  • Designed for low-resource language technology development.

2.2 Primary intended task

  • Automatic Speech Recognition (ASR) training and evaluation.

2.3 Secondary possible uses

  • Dialect-aware ASR experiments

  • Dialect classification (where labels are reliable)

  • Low-resource transfer learning

  • Educational and linguistic research (aggregate-level only)

  1. LANGUAGE

3.1 Language name

  • Luhya (Bantu language cluster)

3.2 Variants / dialects covered

  • Bukusu

  • Banyala

  • Batsotso

  • Kisa

  • Wanga

  • Kabarasi

  • Samia

3.3 Geographic distribution

  • Western Kenya (Bungoma, Kakamega, Busia, Vihiga, Trans-Nzoia)

  • Eastern Uganda (minority communities)

  1. WHO (DATA CREATION)

  • Native Luhya speakers recruited locally in Kenya.

  • Sentence prompts prepared by linguistically competent contributors.

  • Manual transcription and verification by trained contributors.

  1. WHERE

  • Controlled collection sessions in Kenya.

  1. WHEN

  • September 2025.

  1. SOURCES

  • Generated specifically for this project.

  • Read speech audio paired with verified transcripts.

  • WAV audio format.

  1. DOMAINS

  • General / Culture

  • Agriculture

  • Health

  1. SIZE

  • Approximately 70 hours of transcribed speech.

  1. STRUCTURE

  • Index_file_70h.xlsx

  • vc01/, vc02/, vc03/, vc04/ folders

  • Speaker-level subfolders with WAV files.

  1. WRITING SYSTEM

  • Latin alphabet

  • Five vowels: a, e, i, o, u

  • Phonemic spelling, no tone marking.

  1. INTENDED USE

  • ASR training and evaluation

  • Research and educational use only.

  1. CITATION Digital Divide Data (2025). Luhya 70-Hour Speech Dataset (Kenya).