Datasets

Filters:
MDC Curators

Cuentos en Mam leídos en voz alta

Una colección de cuentos (audio y texto) en la lengua Mam. 40 cuentos, un total de 1 hora 23 minutos de audio con 958 oraciones (7,441 palabras) de texto.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: mam

Task Icon

Task: ASR

Format Icon

Format: MP3, TSV

Size Icon

Size: 110.28 MB

MDC Curators

Cuentos en Kʼicheʼ leídos en voz alta

Una colección de cuentos (audio y texto) en la lengua Kʼicheʼ. 1 hora 51 minutos de audio con 726 oraciones (8,283 palabras) de texto.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: quc

Task Icon

Task: ASR

Format Icon

Format: MP3. TSV

Size Icon

Size: 152.62 MB

MDC Community Concierge

CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes

Over 11 million words (14.4m tokens) from written, spoken and electronic Welsh language sources, taken from a range of genres, language varieties and contexts
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: cy

Task Icon

Task: NLP

Format Icon

Format: TXT, TSV

Size Icon

Size: 147.89 MB

MDC Curators

Finance Sentences - North American Spanish

A public domain corpus of approximately 80,000 sentences (1.3M tokens) of North American Spanish in the finance domain.
License Icon

License: CC0-1.0

Locale Icon

Locale: es-US

Task Icon

Task: NLP

Format Icon

Format: TSV, JSON

Size Icon

Size: 18.35 MB

Community

Thorsten-Voice Dataset 2021.02

German neutral speech dataset (22,668 phrases, 23+ hours), CC0 licensed, LJSpeech-compatible, for TTS research and development.
License Icon

License: CC0-1.0

Locale Icon

Locale: de-DE

Task Icon

Task: TTS

Format Icon

Format: WAV, CSV

Size Icon

Size: 2.55 GB

Community

Persian VOA Corpus 2003-2008

Persian (Farsi) VOA news articles 2003-2008, in original UTF-8 Perso-Arabic script, 7.9 million words, license: public domain.
License Icon

License: Unlicense

Locale Icon

Locale: fa

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 17.16 MB

Institute of African Digital Humanities

Lingala-TTS-Dataset

This dataset contains 4 h and 25 m of audio clips of spontaneous and semi-spontaneous speech in the Lingala language, accompanied by audio/text TSV files.
License Icon

License: NOODL-1.0

Locale Icon

Locale: lin

Task Icon

Task: TTS

Format Icon

Format: WAV, TSV

Size Icon

Size: 962.04 MB

Taruen

Polish Public Domain 20th Century Literature Text Corpus

A 4.2-million-word corpus of 54 iconic Polish novels, multi-volume epics, and documentary prose pieces from the late 19th and early 20th centuries.
License Icon

License: CC0-1.0

Locale Icon

Locale: pl

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 10.86 MB

Taruen

Dolgan Folklore Text Corpus

A 15.6k-word corpus of 19 Dolgan fairy tales, digitized to catalyze NLP research and revitalization for this highly endangered Siberian Turkic language.
License Icon

License: CC0-1.0

Locale Icon

Locale: dlg

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 57.16 KB

Tbilisi State University

GeoLogicQA: An LLM Benchmark for Logical Reasoning in Georgian

A manually-curated dataset of 106 logical reasoning questions in Georgian, adapted from the Kangaroo Mathematics Competition and Komarovi School materials.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: ka

Task Icon

Task: LLM

Format Icon

Format: JSON

Size Icon

Size: 15.14 KB

Community

Bojonegoro Javanese TTS

This dataset contains synthetic speech data covering a variety of everyday-life topics in the Bojonegoro dialect of Javanese, spoken in East Java, Indonesia.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: TTS

Format Icon

Format: .tar.gz, WEBM

Size Icon

Size: 469.50 MB

MIT

ATLAS Cross-Lingual Transfer Matrix

The Cross-Lingual Transfer Matrix from "ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining".
License Icon

License: Apache-2.0

Locale Icon

Locale: en-US

Task Icon

Task: NLP

Format Icon

Format: CSV

Size Icon

Size: 2.36 KB

Kaltepetlahtol

Zacatlán Tepetzintla Nahuatl ASR Dataset

A 14 hour ASR dataset of Nahuatl from Zacatlán and Tepetzintla. Derived from Amith et al (2026)´'s field recordings and transcriptions datasets
License Icon

License: CC-BY-ND-4.0

Locale Icon

Locale: nhi

Task Icon

Task: ASR

Format Icon

Format: FLAC, TSV

Size Icon

Size: 789.98 MB

Taruen

Kyrgyz Folklore Text Corpus

A 427k-word Kyrgyz folklore corpus of tales, proverbs, and aphorisms, digitized from 5 Bishkek academic volumes (2016-2017) for NLP tasks.
License Icon

License: CC0-1.0

Locale Icon

Locale: ky

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.28 MB

OpenCSG

Finweb-Edu-Chinese-v2.2

Fineweb-Edu-Chinese v2.2: Updated Chinese educational web dataset (Fineweb series) — access via www.opencsg.com.
License Icon

License: Apache-2.0

Locale Icon

Locale: zh

Task Icon

Task: LLM

Format Icon

Format: parquet

Size Icon

Size: 624.68 MB

Community

Manggarai Language for NLP

The dataset consists of responses to various prompts written in the Manggarai language. These responses were subsequently read aloud and recorded.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: mqy

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 287.61 MB

Taruen

World Factbook (JSON)

A machine-readable JSON archive of the CIA World Factbook (Jan 2026 snapshot). Includes both standard developer and raw cache versions with image metadata.
License Icon

License: CC0-1.0

Locale Icon

Locale: en

Task Icon

Task: NLP

Format Icon

Format: JSON

Size Icon

Size: 7.10 MB

Balochi Academy

Eastern Balochi Literature Corpus

A UTF-8 normalized Eastern Balochi literature corpus (~1.9M tokens) covering poetry, folklore, novels, and cultural texts for linguistic research and NLP.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: bgp

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 949.67 KB

Nick Fox-Gieg

ABC-Draco

A GLTF Draco conversion of the NYU ABC-Dataset.
License Icon

License: Onshape

Locale Icon

Locale: en-US

Task Icon

Task: CV

Format Icon

Format: GLTF with Draco compression

Size Icon

Size: 43.32 GB

Universidad Nacional Autónoma de México, UNAM

Trabajo de Campo - Huave

Un corpus de audio anotado de la región de San Mateo del Mar, Oaxaca, una lengua de comunidades originarias de México.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: huv

Task Icon

Task: ASR

Format Icon

Format: MP3, TSV

Size Icon

Size: 538.25 MB

Forum for Language Initiatives

Gojri Literature Corpus

A curated Gojri (Gujari) text corpus of approximately 60K tokens covering poetry, stories, short stories, and literary prose.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: gju

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 117.97 KB

Forum for Language Initiatives

Khowar Literature Corpus by FLI

A multi-genre Khowar language corpus designed for linguistic research, NLP applications, and cultural documentation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: khw

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 244.85 KB

Forum for Language Initiatives

Khowar Word List

A UTF-8 encoded Khowar lexical corpus containing alphabet definitions and a structured word list for linguistic research and language technology development.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: khw

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 64.22 KB

Forum for Language Initiatives

Kohistani Shina Word List

A UTF-8 encoded Kohistani Shina dictionary and word list corpus with lexical entries, meanings, and annotations for linguistic research and NLP use.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: plk

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 394.05 KB