Datasets

Balochistan Educational and Cultural Organization

NAWA-E-WATAN Balochi Newspaper Corpus

A ~1.02M-token Balochi newspaper corpus from NAWA-E-WATAN, representing contemporary journalistic and public discourse.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: bgn

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.43 MB

Collaborative Action For Research & Development (CARD)

Gawri (گاؤری) Magazine Corpus

The Gawri (گاؤری) Magazine Corpus is a curated collection of monthly Gawri magazine text (~67,724 tokens) that supports research and language technology.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: gwc

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 146.71 KB

Community

TTS Javanese - Ngapak Dialect

A scripted speech collection of audio recordings featuring the distinctive Ngapak dialect from the North Coast of Central Java (Pantura) Province, Indonesia.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 567.12 MB

Universitas Gadjah Mada

Jember Javanese Spontaneous Speech Corpus

A 10-hour spoken dataset of native Javanese speakers from Jember, East Java, representing the Jember dialect and Pandhalungan variety in natural speech.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: ASR

Format Icon

Format: MP3, TSV

Size Icon

Size: 271.65 MB

Ok nemi totlahtool

Zacatlán Tepetzintla Nahuatl Transcriptions

The most up-to-date version of ongoing transcription effort corresponding to the Zacatlan Tepetzinla Nahuatl Audio dataset.
License Icon

License: CC-BY-ND-4.0

Locale Icon

Locale: nhi

Task Icon

Task: ASR

Format Icon

Format: TRS

Size Icon

Size: 320.28 KB

Ok nemi totlahtool

Zacatlán Tepetzintla Nahuatl Audio

Approximately 114 hours of recorded audio of Zacatlán-Ahuacatlán-Tepetzintla Nahuatl language (Glottocode zaca1241).
License Icon

License: CC-BY-ND-4.0

Locale Icon

Locale: nhi

Task Icon

Task: ASR

Format Icon

Format: WAV

Size Icon

Size: 50.19 GB

Institute of African Digital Humanities

Bulu-TTS-Dataset 1.0

The dataset consists of 3 hours and 16 minutes of denoised audio clips, each paired with text and read by a single Bulu speaker.
License Icon

License: NOODL-1.0

Locale Icon

Locale: bum

Task Icon

Task: TTS

Format Icon

Format: MP3, TSV

Size Icon

Size: 87.40 MB

Community

TTS Sasak Language

TTS dataset that uses everyday Sasak language in informal contexts with various topics.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: sas

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 293.92 MB

Community

Betawi TTS of Cultural Language (BEKAL)

Betawi TTS of Cultural Language (BEKAL) this dataset uses the Betawi dialect of West Java with Indonesian code-mixing and code-switching.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: bew

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 309.99 MB

Digital Divide Data

Khmer ASR Cultural Dataset (V2)

106.53 hours manually curated speech-text pairs by native speakers in Khmer language about Cambodian cultural topics. On average, each recording is 8.42 seconds with the standard deviation of 3.39. Speaker metadata (gender, age group, and origin city) is provided. - Language: Khmer (khm). - Source(s): Native speakers from Cambodia (4 females, 4 males). The utterances were manually generated based on topics and subtopics listed in metadata. - Domain(s): Cultural domain, with a total of 2 topics and 118 subtopics. - Size: 45.57k data instances - WAV file names are formatted as: `{speaker_id}_khm_{sentence_id}.wav`.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: khm

Task Icon

Task: ASR

Format Icon

Format: WAV

Size Icon

Size: 35.86 GB

Taruen

Taruen's Tatar Folklore Text Corpus

A 485k-word Tatar folklore corpus from 20th-century field recordings, selected from 5 academic volumes to prioritize contemporary linguistic usage.
License Icon

License: CC0-1.0

Locale Icon

Locale: tt

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.40 MB

Community

TTS-Tolaki

This dataset comprises a compilation of cultural narratives and children’s stories from Southeast Sulawesi, Indonesia, presented in the Tolaki language.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: lbw

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 249.04 MB

Community

Mandar Spontaneous Speech

This dataset is a compilation of spontaneous Mandar speech featuring Indonesian code-switching.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: mdr

Task Icon

Task: ASR

Format Icon

Format: MP3, TSV

Size Icon

Size: 534.45 MB

Community

TTS Central Javanese

This dataset consists of audio recordings and textual data in Central Javanese (Semarang dialect) including Indonesian and English code-switching.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 440.11 MB

Community

TTS Javanese-Lumajang Dialect

This dataset comprises audio recordings of scripted speech in Javanese of Lumajang Dialect from East Java of Indonesia.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 684.32 MB

Institute of African Digital Humanities

Adamawa Fulfulde-French Parallel Corpus of Narratives 1.2

This dataset is an updated version of the 'Adamawa Fulfulde–French Parallel Corpus of Narratives 1.1'.
License Icon

License: NOODL-1.0

Locale Icon

Locale: fub

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 112.17 KB

Institute of African Digital Humanities

Ewondo-TTS-Dataset

The dataset consists of four hours of high-quality audio clips, each paired with text and read by a single speaker.
License Icon

License: NOODL-1.0

Locale Icon

Locale: ewo

Task Icon

Task: TTS

Format Icon

Format: MP3, TSV

Size Icon

Size: 152.70 MB

Institute of African Digital Humanities

Bamun-French Parallel Corpus 1.1

This dataset is an updated version of the "Bamun-French Parallel Corpus", a parallel corpus of texts in Bamun (Shupament) and French.
License Icon

License: NOODL-1.0

Locale Icon

Locale: bax

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 99.78 KB

Community

TTS Muna Dataset

This dataset comprises a compilation of cultural narratives and children’s stories from Southeast Sulawesi, Indonesia, presented in the Muna language.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: mnb

Task Icon

Task: TTS

Format Icon

Format: WEBM & TSV

Size Icon

Size: 316.34 MB

The University of Melbourne

Hawrami Kurdish TTS dataset 1.0

This dataset contains high-quality single-speaker audio recordings in Hawrami Kurdish (Hewrami, ISO 639-3:hac), also known as the Gorani language, intended for building Text-to-Speech (TTS) and Automatic Speech recognition (ASR) systems. The dataset comprises 5 hours and 15 minutes of aligned audio and text data. Hawrami is classified as Definitely Endangered by UNESCO.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: hac

Task Icon

Task: TTS

Format Icon

Format: WAV

Size Icon

Size: 706.11 MB

Common Voice

Common Voice 7.0 - Single Word Target Segment

This dataset contains the numbers 0 to 9 and the words "yes" and "no" in 34 languages. It contains 84 validated hours of speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: mul

Task Icon

Task: ASR

Format Icon

Format: TSV, MP3

Size Icon

Size: 3.51 GB

EELLAK - GreekFOSS

Greek PhD Theses Corpus v1.0

The Greek PhD Theses Corpus is a large, AI-ready dataset
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: gr-GR

Task Icon

Task: NLP

Format Icon

Format: JASONL

Size Icon

Size: 7.02 GB

EELLAK - GreekFOSS

openbook.gr v1.0

Greek digital books corpus for NLP and linguistic analysis
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: gr-GR

Task Icon

Task: NLP

Format Icon

Format: Markdown (.md)

Size Icon

Size: 251.63 MB

TidyVoice2026 Challenge

TidyVoiceX2_ASV

This dataset is designed for speaker verification using the Mozilla Common Voice corpus, covering approximately 40 additional languages beyond those included in TidyVoiceX_ASV. It comprises recordings from different speakers, each of whom appears in multiple languages. Leveraging this multilingual overlap, we construct trial pairs to investigate cross-lingual variation in the speaker verification task. This dataset served as the evaluation set for the TidyVoice 2026 Challenge.
License Icon

License: CC0-1.0

Locale Icon

Locale: mul

Task Icon

Task: OTH

Format Icon

Format: WAV

Size Icon

Size: 23.11 GB