Khmer ASR Cultural Dataset

License icon

License:

CC-BY-SA-4.0

Shield icon

Steward:

Digital Divide Data

Task: ASR

Release Date: 1/13/2026

Format: WAV

Size: 12.59 GB


Share

Description

37.62 hours manually curated speech-text pairs by native speakers in Khmer language about Cambodian cultural topics. On average, each recording is 8.29 seconds with the standard deviation of 3.87. Speaker metadata (gender, age group, and origin city) is provided.

Specifics

Licensing

Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)

https://spdx.org/licenses/CC-BY-SA-4.0.html

Considerations

Restrictions/Special Constraints

Please attribute Digital Divide Data if you use this dataset in any way.

Forbidden Usage

You agree not to attempt to determine the identity of speakers in this dataset.

Metadata

Language: Khmer (khm).

Source(s): Native speakers from Cambodia (4 females, 4 males). The utterances were manually generated based on topics and subtopics listed in metadata.

Domain(s): Cultural domain, with a total of 2 topics and 118 subtopics.

Size: 16,339 data instances

Structure:

metadata.csv
speaker_metadata.csv
data/
    giving_gift/*.wav
    recipes/*.wav

WAV file names are formatted as: {speaker_id}_khm_{sentence_id}.wav.

Sample:

TopicSubtopicSpeaker IDParagraph IDSentence IDSentences
RecipesStreet food dishesf-adt1-00011recipes_01_0001_0001មុខម្ហូបតាមដងផ្លូវ គឺជាមុខម្ហូបមួយមានភាពសម្បូរបែប និងមានភាពងាយស្រួល ដែលគេពេញនិយមក្នុងការបរិភោគ ថែមទាំងមានតម្លៃសមរម្យ។

This dataset is also available on https://huggingface.co/datasets/DDD-Cambodia/khm-asr-cultural.