Datasets

Filters:
Search results for “Thai”
Common Voice

Common Voice Scripted Speech 24.0 - Thai

A collection of scripted spoken phrases in Thai.
License Icon

License: CC0-1.0

Locale Icon

Locale: th

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 8.35 GB

Common Voice

Common Voice Spontaneous Speech 2.0 - Thai

A collection of spontaneous spoken phrases in Thai.
License Icon

License: CC0-1.0

Locale Icon

Locale: th

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 87.66 KB

Common Voice

Common Voice Scripted Speech 24.0 - Lao

A collection of scripted spoken phrases in Lao.
License Icon

License: CC0-1.0

Locale Icon

Locale: lo

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 8.91 MB

Common Voice

Common Voice Scripted Speech 24.0 - Tamil

A collection of scripted spoken phrases in Tamil.
License Icon

License: CC0-1.0

Locale Icon

Locale: ta

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 8.56 GB

Common Voice

Common Voice Scripted Speech 24.0 - Chinese (Taiwan)

A collection of scripted spoken phrases in Chinese (Taiwan).
License Icon

License: CC0-1.0

Locale Icon

Locale: zh-TW

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 2.93 GB

Open Home Foundation

Chitwan 1.0

Text to speech dataset for Nepali, male speaker, approximately 1 hour of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: ne-NE

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 61.68 MB

Common Voice

Common Voice Scripted Speech 24.0 - Taiwanese (Minnan)

A collection of scripted spoken phrases in Taiwanese (Minnan).
License Icon

License: CC0-1.0

Locale Icon

Locale: nan-tw

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 462.92 MB

Digital Divide Data

Khmer ASR Cultural Dataset (V2)

106.53 hours manually curated speech-text pairs by native speakers in Khmer language about Cambodian cultural topics. On average, each recording is 8.42 seconds with the standard deviation of 3.39. Speaker metadata (gender, age group, and origin city) is provided. - Language: Khmer (khm). - Source(s): Native speakers from Cambodia (4 females, 4 males). The utterances were manually generated based on topics and subtopics listed in metadata. - Domain(s): Cultural domain, with a total of 2 topics and 118 subtopics. - Size: 45.57k data instances - WAV file names are formatted as: `{speaker_id}_khm_{sentence_id}.wav`.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: khm

Task Icon

Task: ASR

Format Icon

Format: WAV

Size Icon

Size: 35.86 GB

Common Voice

Common Voice Scripted Speech 24.0 - Vietnamese

A collection of scripted spoken phrases in Vietnamese.
License Icon

License: CC0-1.0

Locale Icon

Locale: vi

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 427.03 MB

Common Voice

Common Voice Scripted Speech 24.0 - Atayal

A collection of scripted spoken phrases in Atayal.
License Icon

License: CC0-1.0

Locale Icon

Locale: tay

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 248.14 MB

Common Voice

Common Voice Scripted Speech 24.0 - Hakha Chin

A collection of scripted spoken phrases in Hakha Chin.
License Icon

License: CC0-1.0

Locale Icon

Locale: cnh

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 160.39 MB

MEDIAMEN

English–Punjabi (Shahmukhi) Parallel Sentences Corpus (Mediamen Archives)

This parallel sentences corpus containing 30,405 aligned sentence pairs with a total of approximately 0.62 million tokens, curated from the archival materials of Mediamen (Advertising Agency). The sentences were professionally translated from English into Punjabi (Shahmukhi) and are intended to support machine translation, linguistic research, and Punjabi language technology development, particularly for real-world and contemporary language use.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: en-PK, pnb

Task Icon

Task: MT

Format Icon

Format: CSV

Size Icon

Size: 1.08 MB

Common Voice

Common Voice Scripted Speech 24.0 - Punjabi

A collection of scripted spoken phrases in Punjabi.
License Icon

License: CC0-1.0

Locale Icon

Locale: pa-IN

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 110.84 MB

Common Voice

Common Voice Scripted Speech 24.0 - Tatar

A collection of scripted spoken phrases in Tatar.
License Icon

License: CC0-1.0

Locale Icon

Locale: tt

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 825.25 MB

Common Voice

Common Voice Scripted Speech 24.0 - Bengali

A collection of scripted spoken phrases in Bengali.
License Icon

License: CC0-1.0

Locale Icon

Locale: bn

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 24.75 GB

Common Voice

Common Voice Spontaneous Speech 2.0 - Tashlhiyt

A collection of spontaneous spoken phrases in Tashlhiyt.
License Icon

License: CC0-1.0

Locale Icon

Locale: shi

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 6.50 MB

Common Voice

Common Voice Scripted Speech 24.0 - Telugu

A collection of scripted spoken phrases in Telugu.
License Icon

License: CC0-1.0

Locale Icon

Locale: te

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 58.46 MB

Digital Divide Data

Khmer ASR Cultural Dataset

37.62 hours manually curated speech-text pairs by native speakers in Khmer language about Cambodian cultural topics. On average, each recording is 8.29 seconds with the standard deviation of 3.87. Speaker metadata (gender, age group, and origin city) is provided.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: khm

Task Icon

Task: ASR

Format Icon

Format: WAV

Size Icon

Size: 12.59 GB

Common Voice

Common Voice Scripted Speech 24.0 - Tajik

A collection of scripted spoken phrases in Tajik.
License Icon

License: CC0-1.0

Locale Icon

Locale: tg

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 17.34 MB

Common Voice

Common Voice Scripted Speech 24.0 - Seri

A collection of scripted spoken phrases in Seri.
License Icon

License: CC0-1.0

Locale Icon

Locale: sei

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 208.50 MB

Common Voice

Common Voice Scripted Speech 24.0 - Cantonese

A collection of scripted spoken phrases in Cantonese.
License Icon

License: CC0-1.0

Locale Icon

Locale: yue

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 5.98 GB

Common Voice

Common Voice Scripted Speech 24.0 - Assamese

A collection of scripted spoken phrases in Assamese.
License Icon

License: CC0-1.0

Locale Icon

Locale: as

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 160.08 MB

Common Voice

Common Voice Scripted Speech 24.0 - Dhatki

A collection of scripted spoken phrases in Dhatki.
License Icon

License: CC0-1.0

Locale Icon

Locale: mki

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 187.48 MB

Common Voice

Common Voice Scripted Speech 24.0 - Tupuri

A collection of scripted spoken phrases in Tupuri.
License Icon

License: CC0-1.0

Locale Icon

Locale: tui

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 236.84 MB