Datasets

Filters:
Institute of African Digital Humanities

Lingala-TTS-Dataset

This dataset contains 4 h and 25 m of audio clips of spontaneous and semi-spontaneous speech in the Lingala language, accompanied by audio/text TSV files.
License Icon

License: NOODL-1.0

Locale Icon

Locale: lin

Task Icon

Task: TTS

Format Icon

Format: WAV, TSV

Size Icon

Size: 962.04 MB

Taruen

Polish Public Domain 20th Century Literature Text Corpus

A 4.2-million-word corpus of 54 iconic Polish novels, multi-volume epics, and documentary prose pieces from the late 19th and early 20th centuries.
License Icon

License: CC0-1.0

Locale Icon

Locale: pl

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 10.86 MB

Taruen

Dolgan Folklore Text Corpus

A 15.6k-word corpus of 19 Dolgan fairy tales, digitized to catalyze NLP research and revitalization for this highly endangered Siberian Turkic language.
License Icon

License: CC0-1.0

Locale Icon

Locale: dlg

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 57.16 KB

Tbilisi State University

GeoLogicQA: An LLM Benchmark for Logical Reasoning in Georgian

A manually-curated dataset of 106 logical reasoning questions in Georgian, adapted from the Kangaroo Mathematics Competition and Komarovi School materials.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: ka

Task Icon

Task: LLM

Format Icon

Format: JSON

Size Icon

Size: 15.14 KB

Community

Bojonegoro Javanese TTS

This dataset contains synthetic speech data covering a variety of everyday-life topics in the Bojonegoro dialect of Javanese, spoken in East Java, Indonesia.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: TTS

Format Icon

Format: .tar.gz, WEBM

Size Icon

Size: 469.50 MB

MIT

ATLAS Cross-Lingual Transfer Matrix

The Cross-Lingual Transfer Matrix from "ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining".
License Icon

License: Apache-2.0

Locale Icon

Locale: en-US

Task Icon

Task: NLP

Format Icon

Format: CSV

Size Icon

Size: 2.36 KB

Kaltepetlahtol

Zacatlán Tepetzintla Nahuatl ASR Dataset

A 14 hour ASR dataset of Nahuatl from Zacatlán and Tepetzintla. Derived from Amith et al (2026)´'s field recordings and transcriptions datasets
License Icon

License: CC-BY-ND-4.0

Locale Icon

Locale: nhi

Task Icon

Task: ASR

Format Icon

Format: FLAC, TSV

Size Icon

Size: 789.98 MB

Taruen

Kyrgyz Folklore Text Corpus

A 427k-word Kyrgyz folklore corpus of tales, proverbs, and aphorisms, digitized from 5 Bishkek academic volumes (2016-2017) for NLP tasks.
License Icon

License: CC0-1.0

Locale Icon

Locale: ky

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.28 MB

OpenCSG

Finweb-Edu-Chinese-v2.2

Fineweb-Edu-Chinese v2.2: Updated Chinese educational web dataset (Fineweb series) — access via www.opencsg.com.
License Icon

License: Apache-2.0

Locale Icon

Locale: zh

Task Icon

Task: LLM

Format Icon

Format: parquet

Size Icon

Size: 624.68 MB

Community

Manggarai Language for NLP

The dataset consists of responses to various prompts written in the Manggarai language. These responses were subsequently read aloud and recorded.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: mqy

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 287.61 MB

Taruen

World Factbook (JSON)

A machine-readable JSON archive of the CIA World Factbook (Jan 2026 snapshot). Includes both standard developer and raw cache versions with image metadata.
License Icon

License: CC0-1.0

Locale Icon

Locale: en

Task Icon

Task: NLP

Format Icon

Format: JSON

Size Icon

Size: 7.10 MB

Balochi Academy

Eastern Balochi Literature Corpus

A UTF-8 normalized Eastern Balochi literature corpus (~1.9M tokens) covering poetry, folklore, novels, and cultural texts for linguistic research and NLP.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: bgp

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 949.67 KB

Nick Fox-Gieg

ABC-Draco

A GLTF Draco conversion of the NYU ABC-Dataset.
License Icon

License: Onshape

Locale Icon

Locale: en-US

Task Icon

Task: CV

Format Icon

Format: GLTF with Draco compression

Size Icon

Size: 43.32 GB

Universidad Nacional Autónoma de México, UNAM

Trabajo de Campo - Huave

Un corpus de audio anotado de la región de San Mateo del Mar, Oaxaca, una lengua de comunidades originarias de México.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: huv

Task Icon

Task: ASR

Format Icon

Format: MP3, TSV

Size Icon

Size: 538.25 MB

Forum for Language Initiatives

Gojri Literature Corpus

A curated Gojri (Gujari) text corpus of approximately 60K tokens covering poetry, stories, short stories, and literary prose.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: gju

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 117.97 KB

Forum for Language Initiatives

Khowar Literature Corpus by FLI

A multi-genre Khowar language corpus designed for linguistic research, NLP applications, and cultural documentation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: khw

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 244.85 KB

Forum for Language Initiatives

Khowar Word List

A UTF-8 encoded Khowar lexical corpus containing alphabet definitions and a structured word list for linguistic research and language technology development.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: khw

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 64.22 KB

Forum for Language Initiatives

Kohistani Shina Word List

A UTF-8 encoded Kohistani Shina dictionary and word list corpus with lexical entries, meanings, and annotations for linguistic research and NLP use.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: plk

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 394.05 KB

Balochistan Educational and Cultural Organization

Brahui Research Work Corpus

This corpus consists of research works, academic theses, and scholarly papers, comprising approximately 185,000 tokens. It covers a range of academic topics and formal registers, reflecting standardized writing practices and disciplinary conventions. The corpus is intended to support linguistic analysis, discourse studies, and computational research on academic and research-oriented text.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: brh

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.13 MB

Balochistan Educational and Cultural Organization

Talar (تلار) Barahui Magazine Corpus

A ~150,000-word Brahui corpus from the monthly magazine Talar, covering editorials, essays, fiction, poetry, and socio-cultural commentary.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: brh

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 317.22 KB

Balochistan Educational and Cultural Organization

Western Balochi Literature Cropus

A cleaned literary corpus in Western Balochi (Rakhshani) including articles, research, translations, and creative literature. (~1.1 m tokens)
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: bgn

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.26 MB

Balochistan Educational and Cultural Organization

NAWA-E-WATAN Balochi Newspaper Corpus

A ~1.02M-token Balochi newspaper corpus from NAWA-E-WATAN, representing contemporary journalistic and public discourse.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: bgn

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.43 MB

Collaborative Action For Research & Development (CARD)

Gawri (گاؤری) Magazine Corpus

The Gawri (گاؤری) Magazine Corpus is a curated collection of monthly Gawri magazine text (~67,724 tokens) that supports research and language technology.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: gwc

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 146.71 KB

Community

TTS Javanese - Ngapak Dialect

A scripted speech collection of audio recordings featuring the distinctive Ngapak dialect from the North Coast of Central Java (Pantura) Province, Indonesia.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 567.12 MB