Datasets

Filters:

Search results for “suomi”

Open Home Foundation

Imre 1.0

Text to speech dataset for Hungarian, male speaker, approximately 1.5 hours of read speech.

License: CC0-1.0

Locale: hu-HU

Task: TTS

Format: WEBM

Size: 99.60 MB

Open Home Foundation

Gosia 1.0

Text to speech dataset for Polish, female speaker, approximately 2 hours of read speech.

License: CC0-1.0

Locale: pl-PL

Task: TTS

Format: WEBM

Size: 39.75 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Kenyi

A collection of spontaneous responses to questions in Kenyi.

License: CC0-1.0

Locale: lke

Task: ASR

Format: MP3

Size: 254.56 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Croatian

A collection of spontaneous responses to questions in Croatian.

License: CC0-1.0

Locale: hr

Task: ASR

Format: MP3

Size: 285.11 KB

MEDIAMEN

Mediamen Punjabi Literature Corpus

This corpus is a collection of one million tokens of Western Punjabi language. The data was produced under the Mediamen publishing agency over the last ten years. The corpus contains work of literature including short stories, novels, fiction, non-fiction, and dramas. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.

License: CC-BY-NC-4.0

Locale: pnb

Task: NLP

Format: TXT

Size: 1.82 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Kelabit

A collection of spontaneous responses to questions in Kelabit.

License: CC0-1.0

Locale: kzi

Task: ASR

Format: MP3

Size: 194.33 MB

Open Home Foundation

Pim 1.0

Text to speech dataset for Dutch, male speaker, approximately 2 hours of read speech.

License: CC0-1.0

Locale: nl-NL

Task: TTS

Format: WEBM

Size: 108.08 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Melanau

A collection of spontaneous responses to questions in Melanau.

License: CC0-1.0

Locale: mel

Task: ASR

Format: MP3

Size: 209.11 MB

Open Home Foundation

Berta 1.0

Text to speech dataset for Hungarian, female speaker, approximately 1 hour of read speech.

License: CC0-1.0

Locale: hu-HU

Task: TTS

Format: FLAC

Size: 209.52 MB

Chishti Sons

Chishti Sons Punjabi Literature Corpus

This corpus is a collection of more than one million tokens of Western Punjabi language. The data was produced under the Chishti Sons publishing agency. The corpus contains work of literature including short stories, stories, fiction, non-fiction, and other literary books. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.

License: CC-BY-NC-4.0

Locale: pnb

Task: NLP

Format: TXT

Size: 1.65 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Bahasa Malay

A collection of spontaneous responses to questions in Bahasa Malay.

License: CC0-1.0

Locale: ms-MY

Task: ASR

Format: MP3

Size: 126.60 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Tashlhiyt

A collection of spontaneous responses to questions in Tashlhiyt.

License: CC0-1.0

Locale: shi

Task: ASR

Format: MP3

Size: 35.52 MB

Collaborative Action For Research & Development (CARD)

IBT Torwali Wordlist

The IBT Torwali Wordlist contains approximately 20,000 unique entries in Torwali (ISO 639-3: trw), an under-documented Indo-Aryan language spoken in northern Pakistan. The dataset comprises standardized lexical entries covering core vocabulary, function words, and culturally salient terms, with consistent orthography and normalization suitable for linguistic and computational use. Entries are aligned with English and Urdu glosses, and include part-of-speech tag.

License: CC-BY-SA-4.0

Locale: trw

Task: NLP

Format: CSV

Size: 312.87 KB

Common Voice

Common Voice Spontaneous Speech 3.0 - Sabah Malay

A collection of spontaneous responses to questions in Sabah Malay.

License: CC0-1.0

Locale: msi

Task: ASR

Format: MP3

Size: 277.94 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Alsatian

A collection of spontaneous responses to questions in Alsatian.

License: CC0-1.0

Locale: gsw

Task: ASR

Format: MP3

Size: 116.10 MB

Community

Thorsten-Voice Dataset 2023.09 Hessisch

German regional dialect speech dataset (Hessisch, 2,108 phrases), CC0 licensed, 22,050 Hz mono WAV, for TTS and speech research.

License: CC0-1.0

Locale: de-DE

Task: TTS

Format: WAV,CSV

Size: 255.96 MB

Balochistan Educational and Cultural Organization

Talar (تلار) Barahui Magazine Corpus

A ~150,000-word Brahui corpus from the monthly magazine Talar, covering editorials, essays, fiction, poetry, and socio-cultural commentary.

License: CC-BY-NC-SA-4.0

Locale: brh

Task: NLP

Format: TXT

Size: 317.22 KB

Common Voice

Common Voice Spontaneous Speech 3.0 - Kuku

A collection of spontaneous responses to questions in Kuku.

License: CC0-1.0

Locale: ukv

Task: ASR

Format: MP3

Size: 238.25 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - German

A collection of spontaneous responses to questions in German.

License: CC0-1.0

Locale: de

Task: ASR

Format: MP3

Size: 23.28 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Tudaga

A collection of spontaneous responses to questions in Tudaga.

License: CC0-1.0

Locale: tuq

Task: ASR

Format: MP3

Size: 13.61 MB

Balochistan Educational and Cultural Organization

Brahui Research Work Corpus

This corpus consists of research works, academic theses, and scholarly papers, comprising approximately 185,000 tokens. It covers a range of academic topics and formal registers, reflecting standardized writing practices and disciplinary conventions. The corpus is intended to support linguistic analysis, discourse studies, and computational research on academic and research-oriented text.

License: CC-BY-NC-SA-4.0

Locale: brh

Task: NLP

Format: TXT

Size: 1.13 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Amba

A collection of spontaneous responses to questions in Amba.

License: CC0-1.0

Locale: rwm

Task: ASR

Format: MP3

Size: 266.60 MB

Kaleem Art Press

Jhoke Publisher Multan’s Saraiki Newspaper Corpus

Jhoke Publisher Multan’s Saraiki Newspaper Corpus (~1.25M tokens) is a normalized UTF-8 collection of Saraiki newspaper.

License: CC-BY-NC-4.0

Locale: skr

Task: NLP

Format: TXT

Size: 2.30 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Georgian

A collection of spontaneous responses to questions in Georgian.

License: CC0-1.0

Locale: ka

Task: ASR

Format: MP3

Size: 11.61 MB