Khmer ASR Cultural Dataset (V2)

License:

CC-BY-SA-4.0

Steward:

Digital Divide Data

Task: ASR

Release Date: 2/5/2026

Format: WAV

Size: 35.86 GB

Description

106.53 hours manually curated speech-text pairs by native speakers in Khmer language about Cambodian cultural topics. On average, each recording is 8.42 seconds with the standard deviation of 3.39. Speaker metadata (gender, age group, and origin city) is provided. - Language: Khmer (khm). - Source(s): Native speakers from Cambodia (4 females, 4 males). The utterances were manually generated based on topics and subtopics listed in metadata. - Domain(s): Cultural domain, with a total of 2 topics and 118 subtopics. - Size: 45.57k data instances - WAV file names are formatted as: `{speaker_id}_khm_{sentence_id}.wav`.

Specifics

Licensing

Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)

https://spdx.org/licenses/CC-BY-SA-4.0.html

Considerations

Restrictions/Special Constraints

Please attribute Digital Divide Data if you use this dataset in any way.

Forbidden Usage

You agree not to attempt to determine the identity of speakers in this dataset.

Metadata

Khmer ASR Cultural Dataset

Language: Khmer (khm).
Source(s): Native speakers from Cambodia (4 females, 4 males). The utterances were manually generated based on topics and subtopics listed in metadata.
Domain(s): Cultural domain, with a total of 2 topics and 118 subtopics.
Size: 45.57k data instances
WAV file names are formatted as: {speaker_id}_khm_{sentence_id}.wav.

Sample

The first row of our metadata.csv:

Topic	Subtopic	Speaker ID	Paragraph ID	Sentence ID	Sentences
Recipes	Street food dishes	f-adt1-0001	1	recipes_01_0001_0001	មុខម្ហូបតាមដងផ្លូវ គឺជាមុខម្ហូបមួយមានភាពសម្បូរបែប និងមានភាពងាយស្រួល ដែលគេពេញនិយមក្នុងការបរិភោគ ថែមទាំងមានតម្លៃសមរម្យ។

Khmer ASR Cultural Dataset is also available on HuggingFace.

Use cases

Automatic speech recognition (ASR)

Off-the-shelf state-of-the-art multilingual automatic speech recognition pre-trained models (e.g., OpenAI's Whisper) cannot transcribe Khmer well. Even with further fine-tuning, the error rate (lower is better, 0% means no errors/perfect) for Khmer ASR is far from usable (Lovenia, 2025). See the khm column in Figure 3 below.

image

To have a good automatic speech recognition (ASR) model for Khmer, you will require a large amount of speech-text pairs in Khmer. However, before Khmer ASR Cultural Dataset is available, there was only one Khmer speech-text dataset: OpenSLR 42 with 3.97 hours of speech-text pairs (male only).

Our preliminary experiment shows that even only by adding 650 speech-text pairs from DDD's dataset to the training data, we can decrease the Whisper models' CER by around 0.46%-0.74% compared to only using OpenSLR 42 in the training data. Now the Whisper Large V2's performance in Khmer drops to only 8.11% CER. With more speech-text pairs collected by DDD, ASR models' performance in Khmer will definitely be able to transcribe Khmer audios with even less errors.

Other potential use cases

Khmer ASR Cultural Dataset can also be used to train models on Khmer text-to-speech (TTS), language modeling, topic modeling, and next sentence prediction.

Attribution

Khmer ASR Cultural Dataset's license is Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0). Please attribute Digital Divide Data if you use this dataset in any way.