Khmer ASR Cultural Dataset
License:
CC-BY-SA-4.0
Steward:
Digital Divide DataTask: ASR
Release Date: 1/13/2026
Format: WAV
Size: 12.59 GB
Share
Description
37.62 hours manually curated speech-text pairs by native speakers in Khmer language about Cambodian cultural topics. On average, each recording is 8.29 seconds with the standard deviation of 3.87. Speaker metadata (gender, age group, and origin city) is provided.
Specifics
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
Please attribute Digital Divide Data if you use this dataset in any way.
Forbidden Usage
You agree not to attempt to determine the identity of speakers in this dataset.
Metadata
Language: Khmer (khm).
Source(s): Native speakers from Cambodia (4 females, 4 males). The utterances were manually generated based on topics and subtopics listed in metadata.
Domain(s): Cultural domain, with a total of 2 topics and 118 subtopics.
Size: 16,339 data instances
Structure:
metadata.csv
speaker_metadata.csv
data/
giving_gift/*.wav
recipes/*.wav
WAV file names are formatted as: {speaker_id}_khm_{sentence_id}.wav.
Sample:
| Topic | Subtopic | Speaker ID | Paragraph ID | Sentence ID | Sentences |
|---|---|---|---|---|---|
| Recipes | Street food dishes | f-adt1-0001 | 1 | recipes_01_0001_0001 | មុខម្ហូបតាមដងផ្លូវ គឺជាមុខម្ហូបមួយមានភាពសម្បូរបែប និងមានភាពងាយស្រួល ដែលគេពេញនិយមក្នុងការបរិភោគ ថែមទាំងមានតម្លៃសមរម្យ។ |
This dataset is also available on https://huggingface.co/datasets/DDD-Cambodia/khm-asr-cultural.