Datasets

Filters:

Jhoke Publisher Multan’s Saraiki Newspaper Corpus

Jhoke Publisher Multan’s Saraiki Newspaper Corpus (~1.25M tokens) is a normalized UTF-8 collection of Saraiki newspaper.

License: CC-BY-NC-4.0

Locale: skr

Task: NLP

Format: TXT

Size: 2.30 MB

Institute of African Digital Humanities

Mada-French Parallel Corpus 1.0

This dataset comprises a parallel corpus of 2,154 lines of translations of literary texts from Mada (mxu) to French.

License: NOODL-1.0

Locale: mxu

Task: TTS

Format: TSV

Size: 122.37 KB

Community

Javanese TTS of Banyumasan Dialect

This dataset contains various topics about everyday life. The topics include society, environment, media, education, culture, health, etc.

License: CC-BY-SA-4.0

Locale: jav

Task: TTS

Format: WEBM, TSV

Size: 559.08 MB

Taruen

Finnish Public Domain 20th Century Literature Text Corpus

A 69.1M-word early 20th-century literature corpus from Project Lönnrot. Predominantly Finnish, with a supplementary Swedish collection.

License: CC0-1.0

Locale: fi, sv

Task: NLP

Format: TXT

Size: 205.76 MB

Community

Thorsten-Voice-44kHz-Full

German speech dataset (44.1 kHz, 38k+ files, ~40 hours), CC0 licensed, multi-style (neutral, emotional, dialect), for TTS research.

License: CC0-1.0

Locale: de-DE

Task: TTS

Format: WAV,PARQUET

Size: 7.99 GB

Community

Thorsten-Voice Dataset 2023.09 Hessisch

German regional dialect speech dataset (Hessisch, 2,108 phrases), CC0 licensed, 22,050 Hz mono WAV, for TTS and speech research.

License: CC0-1.0

Locale: de-DE

Task: TTS

Format: WAV,CSV

Size: 255.96 MB

Community

Thorsten-Voice Dataset 2022.10

German neutral speech dataset (12,450 phrases, 11+ hours), CC0 licensed, LJSpeech-compatible, for TTS research and development.

License: CC0-1.0

Locale: de-DE

Task: TTS

Format: WAV,CSV

Size: 1.30 GB

Community

Thorsten-Voice Dataset 2021.06 Emotional

German emotional speech dataset (2,400 recordings, 8 emotions), CC0 licensed, 22,050 Hz mono WAV, for TTS and speech research.

License: CC0-1.0

Locale: de-DE

Task: TTS

Format: WAV,CSV

Size: 380.80 MB

Kaltepetlahtol

Daily Expressions in Highland Puebla Nahuatl

A corpus of more than 1,000 common expressions in Highland Puebla Nahuatl, partially-translated and annotated.

License: CC-BY-SA-4.0

Locale: azz

Task: NLP

Format: TSV

Size: 22.00 KB

MDC Curators

Cuentos en Mam leídos en voz alta

Una colección de cuentos (audio y texto) en la lengua Mam. 40 cuentos, un total de 1 hora 23 minutos de audio con 958 oraciones (7,441 palabras) de texto.

License: CC-BY-SA-4.0

Locale: mam

Task: ASR

Format: MP3, TSV

Size: 110.28 MB

MDC Curators

Cuentos en Kʼicheʼ leídos en voz alta

Una colección de cuentos (audio y texto) en la lengua Kʼicheʼ. 1 hora 51 minutos de audio con 726 oraciones (8,283 palabras) de texto.

License: CC-BY-SA-4.0

Locale: quc

Task: ASR

Format: MP3. TSV

Size: 152.62 MB

MDC Community Concierge

CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes

Over 11 million words (14.4m tokens) from written, spoken and electronic Welsh language sources, taken from a range of genres, language varieties and contexts

License: CC-BY-NC-SA-4.0

Locale: cy

Task: NLP

Format: TXT, TSV

Size: 147.89 MB

MDC Curators

Finance Sentences - North American Spanish

A public domain corpus of approximately 80,000 sentences (1.3M tokens) of North American Spanish in the finance domain.

License: CC0-1.0

Locale: es-US

Task: NLP

Format: TSV, JSON

Size: 18.35 MB

Community

Thorsten-Voice Dataset 2021.02

German neutral speech dataset (22,668 phrases, 23+ hours), CC0 licensed, LJSpeech-compatible, for TTS research and development.

License: CC0-1.0

Locale: de-DE

Task: TTS

Format: WAV, CSV

Size: 2.55 GB

Community

Persian VOA Corpus 2003-2008

Persian (Farsi) VOA news articles 2003-2008, in original UTF-8 Perso-Arabic script, 7.9 million words, license: public domain.

License: Unlicense

Locale: fa

Task: NLP

Format: TXT

Size: 17.16 MB

Institute of African Digital Humanities

Lingala-TTS-Dataset

This dataset contains 4 h and 25 m of audio clips of spontaneous and semi-spontaneous speech in the Lingala language, accompanied by audio/text TSV files.

License: NOODL-1.0

Locale: lin

Task: TTS

Format: WAV, TSV

Size: 962.04 MB

Taruen

Polish Public Domain 20th Century Literature Text Corpus

A 4.2-million-word corpus of 54 iconic Polish novels, multi-volume epics, and documentary prose pieces from the late 19th and early 20th centuries.

License: CC0-1.0

Locale: pl

Task: NLP

Format: TXT

Size: 10.86 MB

Taruen

Dolgan Folklore Text Corpus

A 15.6k-word corpus of 19 Dolgan fairy tales, digitized to catalyze NLP research and revitalization for this highly endangered Siberian Turkic language.

License: CC0-1.0

Locale: dlg

Task: NLP

Format: TXT

Size: 57.16 KB

Tbilisi State University

GeoLogicQA: An LLM Benchmark for Logical Reasoning in Georgian

A manually-curated dataset of 106 logical reasoning questions in Georgian, adapted from the Kangaroo Mathematics Competition and Komarovi School materials.

License: CC-BY-NC-SA-4.0

Locale: ka

Task: LLM

Format: JSON

Size: 15.14 KB

Community

Bojonegoro Javanese TTS

This dataset contains synthetic speech data covering a variety of everyday-life topics in the Bojonegoro dialect of Javanese, spoken in East Java, Indonesia.

License: CC-BY-SA-4.0

Locale: jav

Task: TTS

Format: .tar.gz, WEBM

Size: 469.50 MB

MIT

ATLAS Cross-Lingual Transfer Matrix

The Cross-Lingual Transfer Matrix from "ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining".

License: Apache-2.0

Locale: en-US

Task: NLP

Format: CSV

Size: 2.36 KB

Kaltepetlahtol

Zacatlán Tepetzintla Nahuatl ASR Dataset

A 14 hour ASR dataset of Nahuatl from Zacatlán and Tepetzintla. Derived from Amith et al (2026)´'s field recordings and transcriptions datasets

License: CC-BY-ND-4.0

Locale: nhi

Task: ASR

Format: FLAC, TSV

Size: 789.98 MB

Taruen

Kyrgyz Folklore Text Corpus

A 427k-word Kyrgyz folklore corpus of tales, proverbs, and aphorisms, digitized from 5 Bishkek academic volumes (2016-2017) for NLP tasks.

License: CC0-1.0

Locale: ky

Task: NLP

Format: TXT

Size: 1.28 MB

OpenCSG

Finweb-Edu-Chinese-v2.2

Fineweb-Edu-Chinese v2.2: Updated Chinese educational web dataset (Fineweb series) — access via www.opencsg.com.

License: Apache-2.0

Locale: zh

Task: LLM

Format: parquet

Size: 624.68 MB