Datasets

A 4.2-million-word corpus of 54 iconic Polish novels, multi-volume epics, and documentary prose pieces from the late 19th and early 20th centuries.

Polish Public Domain 20th Century Literature Text Corpus

License: CC0-1.0

Locale: pl

Task: NLP

Format: TXT

Size: 10.86 MB

A 15.6k-word corpus of 19 Dolgan fairy tales, digitized to catalyze NLP research and revitalization for this highly endangered Siberian Turkic language.

Dolgan Folklore Text Corpus

License: CC0-1.0

Locale: dlg

Task: NLP

Format: TXT

Size: 57.16 KB

Tbilisi State University

GeoLogicQA: An LLM Benchmark for Logical Reasoning in Georgian

A manually-curated dataset of 106 logical reasoning questions in Georgian, adapted from the Kangaroo Mathematics Competition and Komarovi School materials.

License: CC-BY-NC-SA-4.0

Locale: ka

Task: LLM

Format: JSON

Size: 15.14 KB

Community

Bojonegoro Javanese TTS

This dataset contains synthetic speech data covering a variety of everyday-life topics in the Bojonegoro dialect of Javanese, spoken in East Java, Indonesia.

License: CC-BY-SA-4.0

Locale: jav

Task: TTS

Format: .tar.gz, WEBM

Size: 469.50 MB

MIT

ATLAS Cross-Lingual Transfer Matrix

The Cross-Lingual Transfer Matrix from "ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining".

License: Apache-2.0

Locale: en-US

Task: NLP

Format: CSV

Size: 2.36 KB

Kaltepetlahtol

Zacatlán Tepetzintla Nahuatl ASR Dataset

A 14 hour ASR dataset of Nahuatl from Zacatlán and Tepetzintla. Derived from Amith et al (2026)´'s field recordings and transcriptions datasets

License: CC-BY-ND-4.0

Locale: nhi

Task: ASR

Format: FLAC, TSV

Size: 789.98 MB

A 427k-word Kyrgyz folklore corpus of tales, proverbs, and aphorisms, digitized from 5 Bishkek academic volumes (2016-2017) for NLP tasks.

Kyrgyz Folklore Text Corpus

License: CC0-1.0

Locale: ky

Task: NLP

Format: TXT

Size: 1.28 MB

OpenCSG

Finweb-Edu-Chinese-v2.2

Fineweb-Edu-Chinese v2.2: Updated Chinese educational web dataset (Fineweb series) — access via www.opencsg.com.

License: Apache-2.0

Locale: zh

Task: LLM

Format: parquet

Size: 624.68 MB

Community

Manggarai Language for NLP

The dataset consists of responses to various prompts written in the Manggarai language. These responses were subsequently read aloud and recorded.

License: CC-BY-NC-SA-4.0

Locale: mqy

Task: TTS

Format: WEBM, TSV

Size: 287.61 MB

A machine-readable JSON archive of the CIA World Factbook (Jan 2026 snapshot). Includes both standard developer and raw cache versions with image metadata.

World Factbook (JSON)

License: CC0-1.0

Locale: en

Task: NLP

Format: JSON

Size: 7.10 MB

Balochi Academy

Eastern Balochi Literature Corpus

A UTF-8 normalized Eastern Balochi literature corpus (~1.9M tokens) covering poetry, folklore, novels, and cultural texts for linguistic research and NLP.

License: CC-BY-NC-4.0

Locale: bgp

Task: NLP

Format: TXT

Size: 949.67 KB

Nick Fox-Gieg

ABC-Draco

A GLTF Draco conversion of the NYU ABC-Dataset.

License: Onshape

Locale: en-US

Task: CV

Format: GLTF with Draco compression

Size: 43.32 GB

Universidad Nacional Autónoma de México, UNAM

Trabajo de Campo - Huave

Un corpus de audio anotado de la región de San Mateo del Mar, Oaxaca, una lengua de comunidades originarias de México.

License: CC-BY-4.0

Locale: huv

Task: ASR

Format: MP3, TSV

Size: 538.25 MB

A curated Gojri (Gujari) text corpus of approximately 60K tokens covering poetry, stories, short stories, and literary prose.

Gojri Literature Corpus

License: CC-BY-NC-4.0

Locale: gju

Task: NLP

Format: TXT

Size: 117.97 KB

A multi-genre Khowar language corpus designed for linguistic research, NLP applications, and cultural documentation.

Khowar Literature Corpus by FLI

License: CC-BY-NC-4.0

Locale: khw

Task: NLP

Format: TXT

Size: 244.85 KB

A UTF-8 encoded Khowar lexical corpus containing alphabet definitions and a structured word list for linguistic research and language technology development.

Khowar Word List

License: CC-BY-NC-4.0

Locale: khw

Task: NLP

Format: TXT

Size: 64.22 KB

A UTF-8 encoded Kohistani Shina dictionary and word list corpus with lexical entries, meanings, and annotations for linguistic research and NLP use.

Kohistani Shina Word List

License: CC-BY-NC-4.0

Locale: plk

Task: NLP

Format: TXT

Size: 394.05 KB

This corpus consists of research works, academic theses, and scholarly papers, comprising approximately 185,000 tokens. It covers a range of academic topics and formal registers, reflecting standardized writing practices and disciplinary conventions. The corpus is intended to support linguistic analysis, discourse studies, and computational research on academic and research-oriented text.

Brahui Research Work Corpus

License: CC-BY-NC-SA-4.0

Locale: brh

Task: NLP

Format: TXT

Size: 1.13 MB

A ~150,000-word Brahui corpus from the monthly magazine Talar, covering editorials, essays, fiction, poetry, and socio-cultural commentary.

Talar (تلار) Barahui Magazine Corpus

License: CC-BY-NC-SA-4.0

Locale: brh

Task: NLP

Format: TXT

Size: 317.22 KB

A cleaned literary corpus in Western Balochi (Rakhshani) including articles, research, translations, and creative literature. (~1.1 m tokens)

Western Balochi Literature Cropus

License: CC-BY-NC-4.0

Locale: bgn

Task: NLP

Format: TXT

Size: 2.26 MB