Datasets

Filters:

Community

Kokoro Speech Dataset

Kokoro Speech Dataset is a public domain Japanese speech dataset. (https://github.com/kaiidams/Kokoro-Speech-Dataset)

License: libribox

Locale: ja

Task: TTS

Format: FLAC

Size: 3.98 GB

Community

Sundanese TTS

This dataset uses the Priangan dialect of West Java with Indonesian code-mixing and code-switching.

License: CC-BY-SA-4.0

Locale: sun

Task: TTS

Format: WEBM, TSV

Size: 298.10 MB

MDC Community Concierge

Bangor Miami Spanish-English Corpus

Spanish-English bilingual speech corpus with 35 hours of recorded audio and 240,000 words.

License: GPL-3.0

Locale: es-US, en-US

Task: ASR

Format: MP3, CHA, TSV

Size: 1.12 GB

Keblagh e Azergi

Elkhani Hazargi Literature Corpus

Hazargi literary corpus (~0.5M tokens) of poetry, folklore, and prose texts representing Hazara linguistic and cultural heritage.

License: CC-BY-NC-4.0

Locale: haz

Task: NLP

Format: TXT

Size: 2.46 MB

Aim Foundation

Dari Literature Corpus by Anjuman e Adabi Nayestan

A ~1 M-token Dari (Afghan Persian) literary corpus compiled by Anjuman e Adabi Nayestan, covering prose, poetry, and cultural texts in Perso-Arabic script.

License: CC-BY-NC-4.0

Locale: prs

Task: NLP

Format: TXT

Size: 12.67 MB

Collaborative Action For Research & Development (CARD)

IBT Torwali Wordlist

The IBT Torwali Wordlist contains approximately 20,000 unique entries in Torwali (ISO 639-3: trw), an under-documented Indo-Aryan language spoken in northern Pakistan. The dataset comprises standardized lexical entries covering core vocabulary, function words, and culturally salient terms, with consistent orthography and normalization suitable for linguistic and computational use. Entries are aligned with English and Urdu glosses, and include part-of-speech tag.

License: CC-BY-SA-4.0

Locale: trw

Task: NLP

Format: CSV

Size: 312.87 KB

MDC Community Concierge

Bangor Siarad Welsh-English Corpus

Welsh-English bilingual speech corpus with 40 hours of recorded audio and transcriptions making up 450,000 words

License: GPL-3.0

Locale: cym

Task: ASR

Format: MP3, CHA. TSV

Size: 2.13 GB

MDC Community Concierge

Bangor Patagonia Welsh-Spanish Corpus

Welsh-Spanish corpus contains around 195,000 words.

License: GPL-3.0

Locale: cym, spa

Task: ASR

Format: MP3, CHA, TSV

Size: 988.02 MB

Kaleem Art Press

Saraiki-English Parallel Corpus

English–Saraiki parallel corpus: 51,447 aligned sentence pairs (~0.89M words), translated by Kaleem Art Press for MT and Saraiki NLP research.

License: CC-BY-NC-4.0

Locale: mul

Task: MT

Format: CSV

Size: 1.92 MB

Kaleem Art Press

Jhoke Publisher Multan’s Saraiki Newspaper Corpus

Jhoke Publisher Multan’s Saraiki Newspaper Corpus (~1.25M tokens) is a normalized UTF-8 collection of Saraiki newspaper.

License: CC-BY-NC-4.0

Locale: skr

Task: NLP

Format: TXT

Size: 2.30 MB

Institute of African Digital Humanities

Mada-French Parallel Corpus 1.0

This dataset comprises a parallel corpus of 2,154 lines of translations of literary texts from Mada (mxu) to French.

License: NOODL-1.0

Locale: mxu

Task: TTS

Format: TSV

Size: 122.37 KB

Community

Javanese TTS of Banyumasan Dialect

This dataset contains various topics about everyday life. The topics include society, environment, media, education, culture, health, etc.

License: CC-BY-SA-4.0

Locale: jav

Task: TTS

Format: WEBM, TSV

Size: 559.08 MB

Taruen

Finnish Public Domain 20th Century Literature Text Corpus

A 69.1M-word early 20th-century literature corpus from Project Lönnrot. Predominantly Finnish, with a supplementary Swedish collection.

License: CC0-1.0

Locale: fi, sv

Task: NLP

Format: TXT

Size: 205.76 MB

Community

Thorsten-Voice-44kHz-Full

German speech dataset (44.1 kHz, 38k+ files, ~40 hours), CC0 licensed, multi-style (neutral, emotional, dialect), for TTS research.

License: CC0-1.0

Locale: de-DE

Task: TTS

Format: WAV,PARQUET

Size: 7.99 GB

Community

Thorsten-Voice Dataset 2023.09 Hessisch

German regional dialect speech dataset (Hessisch, 2,108 phrases), CC0 licensed, 22,050 Hz mono WAV, for TTS and speech research.

License: CC0-1.0

Locale: de-DE

Task: TTS

Format: WAV,CSV

Size: 255.96 MB

Community

Thorsten-Voice Dataset 2022.10

German neutral speech dataset (12,450 phrases, 11+ hours), CC0 licensed, LJSpeech-compatible, for TTS research and development.

License: CC0-1.0

Locale: de-DE

Task: TTS

Format: WAV,CSV

Size: 1.30 GB

Community

Thorsten-Voice Dataset 2021.06 Emotional

German emotional speech dataset (2,400 recordings, 8 emotions), CC0 licensed, 22,050 Hz mono WAV, for TTS and speech research.

License: CC0-1.0

Locale: de-DE

Task: TTS

Format: WAV,CSV

Size: 380.80 MB

Kaltepetlahtol

Daily Expressions in Highland Puebla Nahuatl

A corpus of more than 1,000 common expressions in Highland Puebla Nahuatl, partially-translated and annotated.

License: CC-BY-SA-4.0

Locale: azz

Task: NLP

Format: TSV

Size: 22.00 KB

MDC Curators

Cuentos en Mam leídos en voz alta

Una colección de cuentos (audio y texto) en la lengua Mam. 40 cuentos, un total de 1 hora 23 minutos de audio con 958 oraciones (7,441 palabras) de texto.

License: CC-BY-SA-4.0

Locale: mam

Task: ASR

Format: MP3, TSV

Size: 110.28 MB

MDC Curators

Cuentos en Kʼicheʼ leídos en voz alta

Una colección de cuentos (audio y texto) en la lengua Kʼicheʼ. 1 hora 51 minutos de audio con 726 oraciones (8,283 palabras) de texto.

License: CC-BY-SA-4.0

Locale: quc

Task: ASR

Format: MP3. TSV

Size: 152.62 MB

MDC Community Concierge

CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes

Over 11 million words (14.4m tokens) from written, spoken and electronic Welsh language sources, taken from a range of genres, language varieties and contexts

License: CC-BY-NC-SA-4.0

Locale: cy

Task: NLP

Format: TXT, TSV

Size: 147.89 MB

MDC Curators

Finance Sentences - North American Spanish

A public domain corpus of approximately 80,000 sentences (1.3M tokens) of North American Spanish in the finance domain.

License: CC0-1.0

Locale: es-US

Task: NLP

Format: TSV, JSON

Size: 18.35 MB

Community

Thorsten-Voice Dataset 2021.02

German neutral speech dataset (22,668 phrases, 23+ hours), CC0 licensed, LJSpeech-compatible, for TTS research and development.

License: CC0-1.0

Locale: de-DE

Task: TTS

Format: WAV, CSV

Size: 2.55 GB

Community

Persian VOA Corpus 2003-2008

Persian (Farsi) VOA news articles 2003-2008, in original UTF-8 Perso-Arabic script, 7.9 million words, license: public domain.

License: Unlicense

Locale: fa

Task: NLP

Format: TXT

Size: 17.16 MB