Datasets

Forum for Language Initiatives

Gojri Literature Corpus

A curated Gojri (Gujari) text corpus of approximately 60K tokens covering poetry, stories, short stories, and literary prose.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: gju

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 117.97 KB

Forum for Language Initiatives

Khowar Literature Corpus by FLI

A multi-genre Khowar language corpus designed for linguistic research, NLP applications, and cultural documentation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: khw

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 244.85 KB

Forum for Language Initiatives

Khowar Word List

A UTF-8 encoded Khowar lexical corpus containing alphabet definitions and a structured word list for linguistic research and language technology development.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: khw

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 64.22 KB

Forum for Language Initiatives

Kohistani Shina Word List

A UTF-8 encoded Kohistani Shina dictionary and word list corpus with lexical entries, meanings, and annotations for linguistic research and NLP use.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: plk

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 394.05 KB

Balochistan Educational and Cultural Organization

Brahui Research Work Corpus

This corpus consists of research works, academic theses, and scholarly papers, comprising approximately 185,000 tokens. It covers a range of academic topics and formal registers, reflecting standardized writing practices and disciplinary conventions. The corpus is intended to support linguistic analysis, discourse studies, and computational research on academic and research-oriented text.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: brh

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.13 MB

Balochistan Educational and Cultural Organization

Talar (تلار) Barahui Magazine Corpus

A ~150,000-word Brahui corpus from the monthly magazine Talar, covering editorials, essays, fiction, poetry, and socio-cultural commentary.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: brh

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 317.22 KB

Balochistan Educational and Cultural Organization

Western Balochi Literature Cropus

A cleaned literary corpus in Western Balochi (Rakhshani) including articles, research, translations, and creative literature. (~1.1 m tokens)
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: bgn

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.26 MB

Balochistan Educational and Cultural Organization

NAWA-E-WATAN Balochi Newspaper Corpus

A ~1.02M-token Balochi newspaper corpus from NAWA-E-WATAN, representing contemporary journalistic and public discourse.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: bgn

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.43 MB

Collaborative Action For Research & Development (CARD)

Gawri (گاؤری) Magazine Corpus

The Gawri (گاؤری) Magazine Corpus is a curated collection of monthly Gawri magazine text (~67,724 tokens) that supports research and language technology.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: gwc

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 146.71 KB

Community

TTS Javanese - Ngapak Dialect

A scripted speech collection of audio recordings featuring the distinctive Ngapak dialect from the North Coast of Central Java (Pantura) Province, Indonesia.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 567.12 MB

Universitas Gadjah Mada

Jember Javanese Spontaneous Speech Corpus

A 10-hour spoken dataset of native Javanese speakers from Jember, East Java, representing the Jember dialect and Pandhalungan variety in natural speech.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: ASR

Format Icon

Format: MP3, TSV

Size Icon

Size: 271.65 MB

Ok nemi totlahtool

Zacatlán Tepetzintla Nahuatl Transcriptions

The most up-to-date version of ongoing transcription effort corresponding to the Zacatlan Tepetzinla Nahuatl Audio dataset.
License Icon

License: CC-BY-ND-4.0

Locale Icon

Locale: nhi

Task Icon

Task: ASR

Format Icon

Format: TRS

Size Icon

Size: 320.28 KB

Ok nemi totlahtool

Zacatlán Tepetzintla Nahuatl Audio

Approximately 114 hours of recorded audio of Zacatlán-Ahuacatlán-Tepetzintla Nahuatl language (Glottocode zaca1241).
License Icon

License: CC-BY-ND-4.0

Locale Icon

Locale: nhi

Task Icon

Task: ASR

Format Icon

Format: WAV

Size Icon

Size: 50.19 GB

Institute of African Digital Humanities

Bulu-TTS-Dataset 1.0

The dataset consists of 3 hours and 16 minutes of denoised audio clips, each paired with text and read by a single Bulu speaker.
License Icon

License: NOODL-1.0

Locale Icon

Locale: bum

Task Icon

Task: TTS

Format Icon

Format: MP3, TSV

Size Icon

Size: 87.40 MB

Community

TTS Sasak Language

TTS dataset that uses everyday Sasak language in informal contexts with various topics.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: sas

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 293.92 MB

Community

Betawi TTS of Cultural Language (BEKAL)

Betawi TTS of Cultural Language (BEKAL) this dataset uses the Betawi dialect of West Java with Indonesian code-mixing and code-switching.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: bew

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 309.99 MB

Digital Divide Data

Khmer ASR Cultural Dataset (V2)

106.53 hours manually curated speech-text pairs by native speakers in Khmer language about Cambodian cultural topics. On average, each recording is 8.42 seconds with the standard deviation of 3.39. Speaker metadata (gender, age group, and origin city) is provided. - Language: Khmer (khm). - Source(s): Native speakers from Cambodia (4 females, 4 males). The utterances were manually generated based on topics and subtopics listed in metadata. - Domain(s): Cultural domain, with a total of 2 topics and 118 subtopics. - Size: 45.57k data instances - WAV file names are formatted as: `{speaker_id}_khm_{sentence_id}.wav`.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: khm

Task Icon

Task: ASR

Format Icon

Format: WAV

Size Icon

Size: 35.86 GB

Taruen

Taruen's Tatar Folklore Text Corpus

A 485k-word Tatar folklore corpus from 20th-century field recordings, selected from 5 academic volumes to prioritize contemporary linguistic usage.
License Icon

License: CC0-1.0

Locale Icon

Locale: tt

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.40 MB

Community

TTS-Tolaki

This dataset comprises a compilation of cultural narratives and children’s stories from Southeast Sulawesi, Indonesia, presented in the Tolaki language.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: lbw

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 249.04 MB

Community

Mandar Spontaneous Speech

This dataset is a compilation of spontaneous Mandar speech featuring Indonesian code-switching.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: mdr

Task Icon

Task: ASR

Format Icon

Format: MP3, TSV

Size Icon

Size: 534.45 MB

Community

TTS Central Javanese

This dataset consists of audio recordings and textual data in Central Javanese (Semarang dialect) including Indonesian and English code-switching.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 440.11 MB

Community

TTS Javanese-Lumajang Dialect

This dataset comprises audio recordings of scripted speech in Javanese of Lumajang Dialect from East Java of Indonesia.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 684.32 MB

Institute of African Digital Humanities

Adamawa Fulfulde-French Parallel Corpus of Narratives 1.2

This dataset is an updated version of the 'Adamawa Fulfulde–French Parallel Corpus of Narratives 1.1'.
License Icon

License: NOODL-1.0

Locale Icon

Locale: fub

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 112.17 KB

Institute of African Digital Humanities

Ewondo-TTS-Dataset

The dataset consists of four hours of high-quality audio clips, each paired with text and read by a single speaker.
License Icon

License: NOODL-1.0

Locale Icon

Locale: ewo

Task Icon

Task: TTS

Format Icon

Format: MP3, TSV

Size Icon

Size: 152.70 MB