Datasets

Filters:
Search results for “suomi”
Open Home Foundation

Imre 1.0

Text to speech dataset for Hungarian, male speaker, approximately 1.5 hours of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: hu-HU

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 99.60 MB

Open Home Foundation

Gosia 1.0

Text to speech dataset for Polish, female speaker, approximately 2 hours of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: pl-PL

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 39.75 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Kenyi

A collection of spontaneous responses to questions in Kenyi.
License Icon

License: CC0-1.0

Locale Icon

Locale: lke

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 254.56 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Croatian

A collection of spontaneous responses to questions in Croatian.
License Icon

License: CC0-1.0

Locale Icon

Locale: hr

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 285.11 KB

MEDIAMEN

Mediamen Punjabi Literature Corpus

This corpus is a collection of one million tokens of Western Punjabi language. The data was produced under the Mediamen publishing agency over the last ten years. The corpus contains work of literature including short stories, novels, fiction, non-fiction, and dramas. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: pnb

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.82 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Kelabit

A collection of spontaneous responses to questions in Kelabit.
License Icon

License: CC0-1.0

Locale Icon

Locale: kzi

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 194.33 MB

Open Home Foundation

Pim 1.0

Text to speech dataset for Dutch, male speaker, approximately 2 hours of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: nl-NL

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 108.08 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Melanau

A collection of spontaneous responses to questions in Melanau.
License Icon

License: CC0-1.0

Locale Icon

Locale: mel

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 209.11 MB

Open Home Foundation

Berta 1.0

Text to speech dataset for Hungarian, female speaker, approximately 1 hour of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: hu-HU

Task Icon

Task: TTS

Format Icon

Format: FLAC

Size Icon

Size: 209.52 MB

Chishti Sons

Chishti Sons Punjabi Literature Corpus

This corpus is a collection of more than one million tokens of Western Punjabi language. The data was produced under the Chishti Sons publishing agency. The corpus contains work of literature including short stories, stories, fiction, non-fiction, and other literary books. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: pnb

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.65 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Bahasa Malay

A collection of spontaneous responses to questions in Bahasa Malay.
License Icon

License: CC0-1.0

Locale Icon

Locale: ms-MY

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 126.60 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Tashlhiyt

A collection of spontaneous responses to questions in Tashlhiyt.
License Icon

License: CC0-1.0

Locale Icon

Locale: shi

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 35.52 MB

Collaborative Action For Research & Development (CARD)

IBT Torwali Wordlist

The IBT Torwali Wordlist contains approximately 20,000 unique entries in Torwali (ISO 639-3: trw), an under-documented Indo-Aryan language spoken in northern Pakistan. The dataset comprises standardized lexical entries covering core vocabulary, function words, and culturally salient terms, with consistent orthography and normalization suitable for linguistic and computational use. Entries are aligned with English and Urdu glosses, and include part-of-speech tag.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: trw

Task Icon

Task: NLP

Format Icon

Format: CSV

Size Icon

Size: 312.87 KB

Common Voice

Common Voice Spontaneous Speech 3.0 - Sabah Malay

A collection of spontaneous responses to questions in Sabah Malay.
License Icon

License: CC0-1.0

Locale Icon

Locale: msi

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 277.94 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Alsatian

A collection of spontaneous responses to questions in Alsatian.
License Icon

License: CC0-1.0

Locale Icon

Locale: gsw

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 116.10 MB

Community

Thorsten-Voice Dataset 2023.09 Hessisch

German regional dialect speech dataset (Hessisch, 2,108 phrases), CC0 licensed, 22,050 Hz mono WAV, for TTS and speech research.
License Icon

License: CC0-1.0

Locale Icon

Locale: de-DE

Task Icon

Task: TTS

Format Icon

Format: WAV,CSV

Size Icon

Size: 255.96 MB

Balochistan Educational and Cultural Organization

Talar (تلار) Barahui Magazine Corpus

A ~150,000-word Brahui corpus from the monthly magazine Talar, covering editorials, essays, fiction, poetry, and socio-cultural commentary.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: brh

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 317.22 KB

Common Voice

Common Voice Spontaneous Speech 3.0 - Kuku

A collection of spontaneous responses to questions in Kuku.
License Icon

License: CC0-1.0

Locale Icon

Locale: ukv

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 238.25 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - German

A collection of spontaneous responses to questions in German.
License Icon

License: CC0-1.0

Locale Icon

Locale: de

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 23.28 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Tudaga

A collection of spontaneous responses to questions in Tudaga.
License Icon

License: CC0-1.0

Locale Icon

Locale: tuq

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 13.61 MB

Balochistan Educational and Cultural Organization

Brahui Research Work Corpus

This corpus consists of research works, academic theses, and scholarly papers, comprising approximately 185,000 tokens. It covers a range of academic topics and formal registers, reflecting standardized writing practices and disciplinary conventions. The corpus is intended to support linguistic analysis, discourse studies, and computational research on academic and research-oriented text.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: brh

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.13 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Amba

A collection of spontaneous responses to questions in Amba.
License Icon

License: CC0-1.0

Locale Icon

Locale: rwm

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 266.60 MB

Kaleem Art Press

Jhoke Publisher Multan’s Saraiki Newspaper Corpus

Jhoke Publisher Multan’s Saraiki Newspaper Corpus (~1.25M tokens) is a normalized UTF-8 collection of Saraiki newspaper.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: skr

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.30 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Georgian

A collection of spontaneous responses to questions in Georgian.
License Icon

License: CC0-1.0

Locale Icon

Locale: ka

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 11.61 MB