Datasets

English–Saraiki parallel corpus: 51,447 aligned sentence pairs (~0.89M words), translated by Kaleem Art Press for MT and Saraiki NLP research.

Saraiki-English Parallel Corpus

License: CC-BY-NC-4.0

Locale: mul

Task: MT

Format: CSV

Size: 1.92 MB

MEDIAMEN

English–Punjabi (Shahmukhi) Parallel Sentences Corpus (Mediamen Archives)

This parallel sentences corpus containing 30,405 aligned sentence pairs with a total of approximately 0.62 million tokens, curated from the archival materials of Mediamen (Advertising Agency). The sentences were professionally translated from English into Punjabi (Shahmukhi) and are intended to support machine translation, linguistic research, and Punjabi language technology development, particularly for real-world and contemporary language use.

License: CC-BY-NC-4.0

Locale: en-PK, pnb

Task: MT

Format: CSV

Size: 1.08 MB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part1

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for Somali language, produced by Digital Divide Data

License: CC-BY-4.0

Locale: som

Task: ASR

Format: WAV, TSV

Size: 7.68 GB

A collection of spontaneous responses to questions in Shona.

Common Voice Spontaneous Speech 3.0 - Shona

License: CC0-1.0

Locale: sn

Task: ASR

Format: MP3

Size: 1.53 MB

Text to speech dataset for Slovak, female speaker, approximately 2 hours of read speech.

Lili 1.0

License: CC0-1.0

Locale: sk-SK

Task: TTS

Format: WEBM

Size: 72.38 MB

This corpus contains approximately one million tokens of Saraiki text curated over the past ten years by Kaleem Art Press. It features a wide range of literary genres, including stories, short stories, novels, fiction, non-fiction, travelogues, poetry, biographies, and historical writings. All content is shared with full author approval. The dataset is intended to support linguistic research, Saraiki language technology development, and the preservation of cultural and literary heritage.

Kaleem Art Press Saraiki Literature Corpus

License: CC-BY-NC-4.0

Locale: skr

Task: OTH

Format: TXT

Size: 1.84 MB

A collection of spontaneous responses to questions in Esperanto.

Common Voice Spontaneous Speech 3.0 - Esperanto

License: CC0-1.0

Locale: eo

Task: ASR

Format: MP3

Size: 12.51 MB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part3

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for the Somali language, produced by Digital Divide Data.

License: CC-BY-4.0

Locale: som

Task: ASR

Format: WAV, TSV

Size: 1.33 GB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part2

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for the Somali language, produced by Digital Divide Data.

License: CC-BY-4.0

Locale: som

Task: ASR

Format: WAV, TSV

Size: 8.07 GB

Baloch Publishers Multan

Baloch Publishers Saraiki Literature Corpus

This corpus is a collection of one million tokens of Saraiki language. The data was produced under the Baloch Publishers over the last ten years. The corpus contains work of literature including dictionaries, short stories, novels, fiction, non-fiction, and travelogue. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.

License: CC-BY-NC-4.0

Locale: skr

Task: NLP

Format: TXT

Size: 2.04 MB

A collection of spontaneous responses to questions in Ruuli.

Common Voice Spontaneous Speech 3.0 - Ruuli

License: CC0-1.0

Locale: ruc

Task: ASR

Format: MP3

Size: 365.95 MB

A collection of spontaneous responses to questions in Sinhala.

Common Voice Spontaneous Speech 3.0 - Sinhala

License: CC0-1.0

Locale: si

Task: ASR

Format: MP3

Size: 2.52 MB

Forum for Language Initiatives

Kohistani Shina Word List

A UTF-8 encoded Kohistani Shina dictionary and word list corpus with lexical entries, meanings, and annotations for linguistic research and NLP use.

License: CC-BY-NC-4.0

Locale: plk

Task: NLP

Format: TXT

Size: 394.05 KB

MDC Curators

Finance Sentences - North American Spanish

A public domain corpus of approximately 80,000 sentences (1.3M tokens) of North American Spanish in the finance domain.

License: CC0-1.0

Locale: es-US

Task: NLP

Format: TSV, JSON

Size: 18.35 MB

A collection of spontaneous responses to questions in Frisian.

Common Voice Spontaneous Speech 3.0 - Frisian

License: CC0-1.0

Locale: fy-NL

Task: ASR

Format: MP3

Size: 323.25 KB

Text to speech dataset for Russian, male speaker, approximately 2 hours of read speech.

Dmitri 1.0

License: CC0-1.0

Locale: ru-RU

Task: TTS

Format: WEBM

Size: 96.63 MB

This contains multiple Saraiki Language books of Stories, Short Stories, Novel, Travelogue, Sentences and collection of articles.

Saraiki Literature Corpus

License: CC-BY-NC-4.0

Locale: skr

Task: OTH

Format: TXT

Size: 1.84 MB

Text to speech dataset for Hungarian, female speaker, approximately 1.5 hours of read speech.

Anna 1.0

License: CC0-1.0

Locale: hu-HU

Task: TTS

Format: WEBM

Size: 95.27 MB

This dataset is a multilingual parallel sentences corpus containing 6,465 aligned sentence units with approximately 0.98 million words, curated from Kaleem Art Press archives. It includes parallel religious text data in Arabic, Urdu, Saraiki (standard and dialectal), Punjabi (Shahmukhi), and English, supporting research in machine translation, comparative linguistics, digital humanities, and low-resource language studies.

Multilingual Religious Parallel Corpus (Kaleem Art Press)

License: CC-BY-SA-4.0

Locale: mul

Task: MT

Format: CSV

Size: 2.27 MB

A collection of spontaneous responses to questions in Kenyah.

Common Voice Spontaneous Speech 3.0 - Kenyah

License: CC0-1.0

Locale: xkl

Task: ASR

Format: MP3

Size: 212.73 MB

Institute of African Digital Humanities

Bamun-French Parallel Corpus 1.1

This dataset is an updated version of the "Bamun-French Parallel Corpus", a parallel corpus of texts in Bamun (Shupament) and French.

License: NOODL-1.0

Locale: bax

Task: MT

Format: TSV

Size: 99.78 KB

Text to speech dataset for German, female speaker, approximately 2 hours of read speech.

Kerstin 1.0

License: CC0-1.0

Locale: de-DE

Task: TTS

Format: WEBM

Size: 132.05 MB