Datasets

Filters:
Search results for “suomi”
Taruen

Finnish Public Domain 20th Century Literature Text Corpus

A 69.1M-word early 20th-century literature corpus from Project Lönnrot. Predominantly Finnish, with a supplementary Swedish collection.
License Icon

License: CC0-1.0

Locale Icon

Locale: fi, sv

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 205.76 MB

Kaleem Art Press

Saraiki-English Parallel Corpus

English–Saraiki parallel corpus: 51,447 aligned sentence pairs (~0.89M words), translated by Kaleem Art Press for MT and Saraiki NLP research.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: mul

Task Icon

Task: MT

Format Icon

Format: CSV

Size Icon

Size: 1.92 MB

MEDIAMEN

English–Punjabi (Shahmukhi) Parallel Sentences Corpus (Mediamen Archives)

This parallel sentences corpus containing 30,405 aligned sentence pairs with a total of approximately 0.62 million tokens, curated from the archival materials of Mediamen (Advertising Agency). The sentences were professionally translated from English into Punjabi (Shahmukhi) and are intended to support machine translation, linguistic research, and Punjabi language technology development, particularly for real-world and contemporary language use.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: en-PK, pnb

Task Icon

Task: MT

Format Icon

Format: CSV

Size Icon

Size: 1.08 MB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part1

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for Somali language, produced by Digital Divide Data
License Icon

License: CC-BY-4.0

Locale Icon

Locale: som

Task Icon

Task: ASR

Format Icon

Format: WAV, TSV

Size Icon

Size: 7.68 GB

Common Voice

Common Voice Spontaneous Speech 3.0 - Shona

A collection of spontaneous responses to questions in Shona.
License Icon

License: CC0-1.0

Locale Icon

Locale: sn

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 1.53 MB

Open Home Foundation

Lili 1.0

Text to speech dataset for Slovak, female speaker, approximately 2 hours of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: sk-SK

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 72.38 MB

Kaleem Art Press

Kaleem Art Press Saraiki Literature Corpus

This corpus contains approximately one million tokens of Saraiki text curated over the past ten years by Kaleem Art Press. It features a wide range of literary genres, including stories, short stories, novels, fiction, non-fiction, travelogues, poetry, biographies, and historical writings. All content is shared with full author approval. The dataset is intended to support linguistic research, Saraiki language technology development, and the preservation of cultural and literary heritage.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: skr

Task Icon

Task: OTH

Format Icon

Format: TXT

Size Icon

Size: 1.84 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Esperanto

A collection of spontaneous responses to questions in Esperanto.
License Icon

License: CC0-1.0

Locale Icon

Locale: eo

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 12.51 MB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part3

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for the Somali language, produced by Digital Divide Data.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: som

Task Icon

Task: ASR

Format Icon

Format: WAV, TSV

Size Icon

Size: 1.33 GB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part2

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for the Somali language, produced by Digital Divide Data.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: som

Task Icon

Task: ASR

Format Icon

Format: WAV, TSV

Size Icon

Size: 8.07 GB

Baloch Publishers Multan

Baloch Publishers Saraiki Literature Corpus

This corpus is a collection of one million tokens of Saraiki language. The data was produced under the Baloch Publishers over the last ten years. The corpus contains work of literature including dictionaries, short stories, novels, fiction, non-fiction, and travelogue. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: skr

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.04 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Ruuli

A collection of spontaneous responses to questions in Ruuli.
License Icon

License: CC0-1.0

Locale Icon

Locale: ruc

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 365.95 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Sinhala

A collection of spontaneous responses to questions in Sinhala.
License Icon

License: CC0-1.0

Locale Icon

Locale: si

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 2.52 MB

Forum for Language Initiatives

Kohistani Shina Word List

A UTF-8 encoded Kohistani Shina dictionary and word list corpus with lexical entries, meanings, and annotations for linguistic research and NLP use.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: plk

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 394.05 KB

MDC Curators

Finance Sentences - North American Spanish

A public domain corpus of approximately 80,000 sentences (1.3M tokens) of North American Spanish in the finance domain.
License Icon

License: CC0-1.0

Locale Icon

Locale: es-US

Task Icon

Task: NLP

Format Icon

Format: TSV, JSON

Size Icon

Size: 18.35 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Frisian

A collection of spontaneous responses to questions in Frisian.
License Icon

License: CC0-1.0

Locale Icon

Locale: fy-NL

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 323.25 KB

Open Home Foundation

Dmitri 1.0

Text to speech dataset for Russian, male speaker, approximately 2 hours of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: ru-RU

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 96.63 MB

Kaleem Art Press

Saraiki Literature Corpus

This contains multiple Saraiki Language books of Stories, Short Stories, Novel, Travelogue, Sentences and collection of articles.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: skr

Task Icon

Task: OTH

Format Icon

Format: TXT

Size Icon

Size: 1.84 MB

Open Home Foundation

Anna 1.0

Text to speech dataset for Hungarian, female speaker, approximately 1.5 hours of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: hu-HU

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 95.27 MB

Kaleem Art Press

Multilingual Religious Parallel Corpus (Kaleem Art Press)

This dataset is a multilingual parallel sentences corpus containing 6,465 aligned sentence units with approximately 0.98 million words, curated from Kaleem Art Press archives. It includes parallel religious text data in Arabic, Urdu, Saraiki (standard and dialectal), Punjabi (Shahmukhi), and English, supporting research in machine translation, comparative linguistics, digital humanities, and low-resource language studies.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: mul

Task Icon

Task: MT

Format Icon

Format: CSV

Size Icon

Size: 2.27 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Kenyah

A collection of spontaneous responses to questions in Kenyah.
License Icon

License: CC0-1.0

Locale Icon

Locale: xkl

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 212.73 MB

Institute of African Digital Humanities

Bamun-French Parallel Corpus 1.1

This dataset is an updated version of the "Bamun-French Parallel Corpus", a parallel corpus of texts in Bamun (Shupament) and French.
License Icon

License: NOODL-1.0

Locale Icon

Locale: bax

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 99.78 KB

Open Home Foundation

Kerstin 1.0

Text to speech dataset for German, female speaker, approximately 2 hours of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: de-DE

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 132.05 MB

Open Home Foundation

Mihai 1.0

Text to speech dataset for Romanian, male speaker, approximately 2 hours of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: ro-RO

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 66.31 MB