Khmer ASR Cultural Dataset (V2)
License:
CC-BY-SA-4.0
Steward:
Digital Divide DataTask: ASR
Release Date: 2/5/2026
Format: WAV
Size: 35.86 GB
Share
Description
106.53 hours manually curated speech-text pairs by native speakers in Khmer language about Cambodian cultural topics. On average, each recording is 8.42 seconds with the standard deviation of 3.39. Speaker metadata (gender, age group, and origin city) is provided. - Language: Khmer (khm). - Source(s): Native speakers from Cambodia (4 females, 4 males). The utterances were manually generated based on topics and subtopics listed in metadata. - Domain(s): Cultural domain, with a total of 2 topics and 118 subtopics. - Size: 45.57k data instances - WAV file names are formatted as: `{speaker_id}_khm_{sentence_id}.wav`.
Specifics
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
Please attribute Digital Divide Data if you use this dataset in any way.
Forbidden Usage
You agree not to attempt to determine the identity of speakers in this dataset.
Metadata
Khmer ASR Cultural Dataset
106.53 hours manually curated speech-text pairs by native speakers in Khmer language about Cambodian cultural topics. On average, each recording is 8.42 seconds with the standard deviation of 3.39. Speaker metadata (gender, age group, and origin city) is provided.
Language: Khmer (khm).
Source(s): Native speakers from Cambodia (4 females, 4 males). The utterances were manually generated based on topics and subtopics listed in metadata.
Domain(s): Cultural domain, with a total of 2 topics and 118 subtopics.
Size: 45.57k data instances
WAV file names are formatted as:
{speaker_id}_khm_{sentence_id}.wav.
Sample
The first row of our metadata.csv:
| Topic | Subtopic | Speaker ID | Paragraph ID | Sentence ID | Sentences |
|---|---|---|---|---|---|
| Recipes | Street food dishes | f-adt1-0001 | 1 | recipes_01_0001_0001 | មុខម្ហូបតាមដងផ្លូវ គឺជាមុខម្ហូបមួយមានភាពសម្បូរបែប និងមានភាពងាយស្រួល ដែលគេពេញនិយមក្នុងការបរិភោគ ថែមទាំងមានតម្លៃសមរម្យ។ |
Khmer ASR Cultural Dataset is also available on HuggingFace.
Use cases
Automatic speech recognition (ASR)
Off-the-shelf state-of-the-art multilingual automatic speech recognition pre-trained models (e.g., OpenAI's Whisper) cannot transcribe Khmer well. Even with further fine-tuning, the error rate (lower is better, 0% means no errors/perfect) for Khmer ASR is far from usable (Lovenia, 2025). See the khm column in Figure 3 below.
image
To have a good automatic speech recognition (ASR) model for Khmer, you will require a large amount of speech-text pairs in Khmer. However, before Khmer ASR Cultural Dataset is available, there was only one Khmer speech-text dataset: OpenSLR 42 with 3.97 hours of speech-text pairs (male only).
Our preliminary experiment shows that even only by adding 650 speech-text pairs from DDD's dataset to the training data, we can decrease the Whisper models' CER by around 0.46%-0.74% compared to only using OpenSLR 42 in the training data. Now the Whisper Large V2's performance in Khmer drops to only 8.11% CER. With more speech-text pairs collected by DDD, ASR models' performance in Khmer will definitely be able to transcribe Khmer audios with even less errors.
Other potential use cases
Khmer ASR Cultural Dataset can also be used to train models on Khmer text-to-speech (TTS), language modeling, topic modeling, and next sentence prediction.
Attribution
Khmer ASR Cultural Dataset's license is Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0). Please attribute Digital Divide Data if you use this dataset in any way.