ddd-kenya-somali-68hrs-asr-part3

License:

CC-BY-4.0

Steward:

Digital Divide Data

Task: ASR

Release Date: 3/12/2026

Format: WAV, TSV

Size: 1.33 GB

Description

This dataset, curated by Digital Divide Data (DDD), provides high-quality audio recordings and corresponding text transcriptions for the Somali (som) language. The collection includes thousands of unique utterances per language to support diverse acoustic modeling. All transcriptions have undergone a manual verification process to ensure high linguistic accuracy. Recordings feature a balanced mix of genders and various age groups to minimize bias in downstream AI models. This data is specifically designed for training Automatic Speech Recognition (ASR) systems, Text-to-Speech (TTS) synthesis, and general linguistic research for underrepresented African languages.

Specifics

Licensing

Creative Commons Attribution 4.0 International (CC-BY-4.0)

https://spdx.org/licenses/CC-BY-4.0.html

Considerations

Restrictions/Special Constraints

For research and scientific use only Redistribution or re-hosting of this dataset is not permitted without written permission from Digital Divide Data.

Forbidden Usage

You agree not to attempt to determine the identity of any speaker in the dataset. It is forbidden to use this dataset for voice cloning, biometric identification, surveillance, or the creation of synthetic voices mimicking speakers.

Processes

Ethical Review

All participants gave informed consent prior to recording. Speakers were briefed about the use of their audio for research and could withdraw at any time. No personal names or identifying information are included in the dataset.

Intended Use

This dataset is intended for developing and evaluating Automatic Speech Recognition (ASR) systems for under-resourced African languages, studying cross-dialect variation, training self-supervised audio models, and supporting linguistic research on Somali dialects.

Metadata

TECHNICAL DATASHEET SOMALI 68-HOUR SPEECH DATASET DIGITAL DIVIDE DATA (KENYA)

CONSIDERATIONS

1.1 Forbidden usage Do not attempt to identify or re-identify any speaker. Do not link voices to real people or external datasets. Do not use the dataset for surveillance or biometric identification. Do not re-host, re-share, or redistribute the dataset without written permission.

1.2 Privacy and ethics All speakers are represented only by pseudonymous IDs. Demographic metadata is intended for aggregate analysis only. Individual-level inference is strictly discouraged.

WHAT (DATASET OVERVIEW)

2.1 Dataset summary A curated collection of read speech recordings from native Somali speakers. Collected in Kenya through controlled voice collection sessions. Designed for low-resource language technology development.

2.2 Primary intended task Automatic Speech Recognition (ASR) training and evaluation.

2.3 Secondary possible uses Dialect-aware ASR experiments Dialect classification (where labels are reliable) Low-resource transfer learning Educational and linguistic research (aggregate-level only)

LANGUAGE

3.1 Language name Somali

3.2 Variants/dialects covered maxaatirii

3.3 Geographic distribution Northeastern, Kenya

WHO (DATA CREATION)

Native Somali speakers were recruited locally in Kenya. Sentence prompts prepared by linguistically competent contributors. Manual transcription and verification by trained contributors.

WHERE Controlled collection sessions in Kenya.
WHEN March 2026.
SOURCES Generated specifically for this project. Read speech audio paired with verified transcripts. WAV audio format.
DOMAINS General / Culture
SIZE Approximately 68 hours of transcribed speech.
STRUCTURE Index_file_68h.xlsx Speaker-level subfolders with WAV files.
WRITING SYSTEM Latin alphabet Five vowels: a, e, i, o, u Phonemic spelling, no tone marking.
INTENDED USE ASR training and evaluation Research and educational use only.
CITATION Digital Divide Data (2026). Somali 68-Hour Speech Dataset (Kenya).