Datasets

Filters:
Community

Kokoro Speech Dataset

Kokoro Speech Dataset is a public domain Japanese speech dataset. (https://github.com/kaiidams/Kokoro-Speech-Dataset)
License Icon

License: libribox

Locale Icon

Locale: ja

Task Icon

Task: TTS

Format Icon

Format: FLAC

Size Icon

Size: 3.98 GB

Community

Sundanese TTS

This dataset uses the Priangan dialect of West Java with Indonesian code-mixing and code-switching.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: sun

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 298.10 MB

MDC Community Concierge

Bangor Miami Spanish-English Corpus

Spanish-English bilingual speech corpus with 35 hours of recorded audio and 240,000 words.
License Icon

License: GPL-3.0

Locale Icon

Locale: es-US, en-US

Task Icon

Task: ASR

Format Icon

Format: MP3, CHA, TSV

Size Icon

Size: 1.12 GB

Keblagh e Azergi

Elkhani Hazargi Literature Corpus

Hazargi literary corpus (~0.5M tokens) of poetry, folklore, and prose texts representing Hazara linguistic and cultural heritage.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: haz

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.46 MB

Aim Foundation

Dari Literature Corpus by Anjuman e Adabi Nayestan

A ~1 M-token Dari (Afghan Persian) literary corpus compiled by Anjuman e Adabi Nayestan, covering prose, poetry, and cultural texts in Perso-Arabic script.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: prs

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 12.67 MB

Collaborative Action For Research & Development (CARD)

IBT Torwali Wordlist

The IBT Torwali Wordlist contains approximately 20,000 unique entries in Torwali (ISO 639-3: trw), an under-documented Indo-Aryan language spoken in northern Pakistan. The dataset comprises standardized lexical entries covering core vocabulary, function words, and culturally salient terms, with consistent orthography and normalization suitable for linguistic and computational use. Entries are aligned with English and Urdu glosses, and include part-of-speech tag.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: trw

Task Icon

Task: NLP

Format Icon

Format: CSV

Size Icon

Size: 312.87 KB

MDC Community Concierge

Bangor Siarad Welsh-English Corpus

Welsh-English bilingual speech corpus with 40 hours of recorded audio and transcriptions making up 450,000 words
License Icon

License: GPL-3.0

Locale Icon

Locale: cym

Task Icon

Task: ASR

Format Icon

Format: MP3, CHA. TSV

Size Icon

Size: 2.13 GB

MDC Community Concierge

Bangor Patagonia Welsh-Spanish Corpus

Welsh-Spanish corpus contains around 195,000 words.
License Icon

License: GPL-3.0

Locale Icon

Locale: cym, spa

Task Icon

Task: ASR

Format Icon

Format: MP3, CHA, TSV

Size Icon

Size: 988.02 MB

Kaleem Art Press

Saraiki-English Parallel Corpus

English–Saraiki parallel corpus: 51,447 aligned sentence pairs (~0.89M words), translated by Kaleem Art Press for MT and Saraiki NLP research.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: mul

Task Icon

Task: MT

Format Icon

Format: CSV

Size Icon

Size: 1.92 MB

Kaleem Art Press

Jhoke Publisher Multan’s Saraiki Newspaper Corpus

Jhoke Publisher Multan’s Saraiki Newspaper Corpus (~1.25M tokens) is a normalized UTF-8 collection of Saraiki newspaper.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: skr

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.30 MB

Institute of African Digital Humanities

Mada-French Parallel Corpus 1.0

This dataset comprises a parallel corpus of 2,154 lines of translations of literary texts from Mada (mxu) to French.
License Icon

License: NOODL-1.0

Locale Icon

Locale: mxu

Task Icon

Task: TTS

Format Icon

Format: TSV

Size Icon

Size: 122.37 KB

Community

Javanese TTS of Banyumasan Dialect

This dataset contains various topics about everyday life. The topics include society, environment, media, education, culture, health, etc.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 559.08 MB

Taruen

Finnish Public Domain 20th Century Literature Text Corpus

A 69.1M-word early 20th-century literature corpus from Project Lönnrot. Predominantly Finnish, with a supplementary Swedish collection.
License Icon

License: CC0-1.0

Locale Icon

Locale: fi, sv

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 205.76 MB

Community

Thorsten-Voice-44kHz-Full

German speech dataset (44.1 kHz, 38k+ files, ~40 hours), CC0 licensed, multi-style (neutral, emotional, dialect), for TTS research.
License Icon

License: CC0-1.0

Locale Icon

Locale: de-DE

Task Icon

Task: TTS

Format Icon

Format: WAV,PARQUET

Size Icon

Size: 7.99 GB

Community

Thorsten-Voice Dataset 2023.09 Hessisch

German regional dialect speech dataset (Hessisch, 2,108 phrases), CC0 licensed, 22,050 Hz mono WAV, for TTS and speech research.
License Icon

License: CC0-1.0

Locale Icon

Locale: de-DE

Task Icon

Task: TTS

Format Icon

Format: WAV,CSV

Size Icon

Size: 255.96 MB

Community

Thorsten-Voice Dataset 2022.10

German neutral speech dataset (12,450 phrases, 11+ hours), CC0 licensed, LJSpeech-compatible, for TTS research and development.
License Icon

License: CC0-1.0

Locale Icon

Locale: de-DE

Task Icon

Task: TTS

Format Icon

Format: WAV,CSV

Size Icon

Size: 1.30 GB

Community

Thorsten-Voice Dataset 2021.06 Emotional

German emotional speech dataset (2,400 recordings, 8 emotions), CC0 licensed, 22,050 Hz mono WAV, for TTS and speech research.
License Icon

License: CC0-1.0

Locale Icon

Locale: de-DE

Task Icon

Task: TTS

Format Icon

Format: WAV,CSV

Size Icon

Size: 380.80 MB

Kaltepetlahtol

Daily Expressions in Highland Puebla Nahuatl

A corpus of more than 1,000 common expressions in Highland Puebla Nahuatl, partially-translated and annotated.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: azz

Task Icon

Task: NLP

Format Icon

Format: TSV

Size Icon

Size: 22.00 KB

MDC Curators

Cuentos en Mam leídos en voz alta

Una colección de cuentos (audio y texto) en la lengua Mam. 40 cuentos, un total de 1 hora 23 minutos de audio con 958 oraciones (7,441 palabras) de texto.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: mam

Task Icon

Task: ASR

Format Icon

Format: MP3, TSV

Size Icon

Size: 110.28 MB

MDC Curators

Cuentos en Kʼicheʼ leídos en voz alta

Una colección de cuentos (audio y texto) en la lengua Kʼicheʼ. 1 hora 51 minutos de audio con 726 oraciones (8,283 palabras) de texto.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: quc

Task Icon

Task: ASR

Format Icon

Format: MP3. TSV

Size Icon

Size: 152.62 MB

MDC Community Concierge

CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes

Over 11 million words (14.4m tokens) from written, spoken and electronic Welsh language sources, taken from a range of genres, language varieties and contexts
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: cy

Task Icon

Task: NLP

Format Icon

Format: TXT, TSV

Size Icon

Size: 147.89 MB

MDC Curators

Finance Sentences - North American Spanish

A public domain corpus of approximately 80,000 sentences (1.3M tokens) of North American Spanish in the finance domain.
License Icon

License: CC0-1.0

Locale Icon

Locale: es-US

Task Icon

Task: NLP

Format Icon

Format: TSV, JSON

Size Icon

Size: 18.35 MB

Community

Thorsten-Voice Dataset 2021.02

German neutral speech dataset (22,668 phrases, 23+ hours), CC0 licensed, LJSpeech-compatible, for TTS research and development.
License Icon

License: CC0-1.0

Locale Icon

Locale: de-DE

Task Icon

Task: TTS

Format Icon

Format: WAV, CSV

Size Icon

Size: 2.55 GB

Community

Persian VOA Corpus 2003-2008

Persian (Farsi) VOA news articles 2003-2008, in original UTF-8 Perso-Arabic script, 7.9 million words, license: public domain.
License Icon

License: Unlicense

Locale Icon

Locale: fa

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 17.16 MB