Datasets

Filters:

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

INEL Nganasan Speech Corpus

A 38.5-hour Speech-to-Text dataset for Nganasan, an endangered Samoyedic language. Features audio aligned with text transcriptions.

License: CC-BY-NC-SA-4.0

Locale: nio

Task: ASR

Format: TSV, MP3

Size: 1.29 GB

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

INEL Evenki Speech Corpus

A 2.5-hour supervised Speech-to-Text dataset for Evenki, an endangered Tungusic language. Features sentence-level audio aligned with text transcriptions.

License: CC-BY-NC-SA-4.0

Locale: evn

Task: ASR

Format: TSV, MP3

Size: 103.03 MB

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

INEL Dolgan Speech Corpus

A 13-hour supervised Speech-to-Text dataset for Dolgan, an endangered Turkic language. Features sentence-level audio aligned with text transcriptions.

License: CC-BY-NC-SA-4.0

Locale: dlg

Task: ASR

Format: TSV, MP3

Size: 583.34 MB

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

INEL Kamas Speech Corpus

A 14-hour Speech-to-Text dataset for Kamas, an extinct Samoyedic language waiting to be revitalized. Features audio aligned with text transcriptions.

License: CC-BY-NC-SA-4.0

Locale: xas

Task: ASR

Format: TSV, MP3

Size: 376.64 MB

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

INEL Selkup Speech Corpus

A 1.6-hour supervised Speech-to-Text dataset for Selkup, an endangered Samoyedic language. Features sentence-level audio-text alignments.

License: CC-BY-NC-SA-4.0

Locale: sel

Task: ASR

Format: TSV, MP3

Size: 45.46 MB

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

INEL Enets Speech Corpus

A 3.7-hour supervised Speech-to-Text dataset for Enets, an endangered Samoyedic language. Features sentence-level audio aligned with text transcriptions.

License: CC-BY-NC-SA-4.0

Locale: enf, enh

Task: ASR

Format: TSV, MP3

Size: 140.56 MB

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

INEL Nenets Speech Corpus

A supervised Speech-to-Text dataset for Nenets, a Samoyedic language. Features sentence-level audio aligned with Cyrillic text transcriptions.

License: CC-BY-NC-SA-4.0

Locale: yrk

Task: ASR

Format: TSV, MP3

Size: 8.35 MB

Common Voice

Common Voice Scripted Speech 25.0 - Bengali

A collection of read speech recordings in Bengali.

License: CC0-1.0

Locale: bn

Task: ASR

Format: MP3

Size: 24.84 GB

Common Voice

Common Voice Scripted Speech 25.0 - Chinese (China)

A collection of read speech recordings in Chinese (China).

License: CC0-1.0

Locale: zh-CN

Task: ASR

Format: MP3

Size: 21.38 GB

LocaleNLP

English Hausa Parallel Corpus

An English–Hausa dataset with 5,000 sentence pairs useful for machine translation and basic language processing tasks.

License: CC-BY-NC-4.0

Locale: eng, hau

Task: MT

Format: csv

Size: 164.32 KB

Anjuman e Katib

Persian Literature Corpus by Najwai Sukhan

A curated Persian literary corpus of ~1.26M tokens spanning literature, poetry, educational writing, and culturally significant texts.

License: CC-BY-NC-4.0

Locale: fas

Task: NLP

Format: TXT

Size: 38.62 MB

Community

Heroes English-Spanish Dubbed Movie Speech Corpus

7000 single speaker speech segments from the original and Spanish dubbed version of 21 episodes of TV series Heroes

License: CC-BY-SA-4.0

Locale: eng, spa

Task: NLP

Format: wav, csv, txt

Size: 1.68 GB

Common Voice

Common Voice Scripted Speech 25.0 - Swahili

A collection of read speech recordings in Swahili.

License: CC0-1.0

Locale: sw

Task: ASR

Format: MP3

Size: 20.87 GB

Common Voice

Common Voice Scripted Speech 25.0 - Kabyle

A collection of read speech recordings in Kabyle.

License: CC0-1.0

Locale: kab

Task: ASR

Format: MP3

Size: 17.43 GB

Common Voice

Common Voice Scripted Speech 25.0 - Basque

A collection of read speech recordings in Basque.

License: CC0-1.0

Locale: eu

Task: ASR

Format: MP3

Size: 14.48 GB

Common Voice

Common Voice Scripted Speech 25.0 - Japanese

A collection of read speech recordings in Japanese.

License: CC0-1.0

Locale: ja

Task: ASR

Format: MP3

Size: 14.34 GB

Common Voice

Common Voice Scripted Speech 25.0 - Luganda

A collection of read speech recordings in Luganda.

License: CC0-1.0

Locale: lg

Task: ASR

Format: MP3

Size: 11.06 GB

Common Voice

Common Voice Scripted Speech 25.0 - Czech

A collection of read speech recordings in Czech.

License: CC0-1.0

Locale: cs

Task: ASR

Format: MP3

Size: 5.56 GB

Common Voice

Common Voice Scripted Speech 25.0 - Urdu

A collection of read speech recordings in Urdu.

License: CC0-1.0

Locale: ur

Task: ASR

Format: MP3

Size: 5.78 GB

Common Voice

Common Voice Scripted Speech 25.0 - Georgian

A collection of read speech recordings in Georgian.

License: CC0-1.0

Locale: ka

Task: ASR

Format: MP3

Size: 6.37 GB

Common Voice

Common Voice Scripted Speech 25.0 - Thai

A collection of read speech recordings in Thai.

License: CC0-1.0

Locale: th

Task: ASR

Format: MP3

Size: 8.38 GB

Common Voice

Common Voice Scripted Speech 25.0 - Russian

A collection of read speech recordings in Russian.

License: CC0-1.0

Locale: ru

Task: ASR

Format: MP3

Size: 6.55 GB

Common Voice

Common Voice Scripted Speech 25.0 - Italian

A collection of read speech recordings in Italian.

License: CC0-1.0

Locale: it

Task: ASR

Format: MP3

Size: 9.71 GB

Common Voice

Common Voice Scripted Speech 25.0 - Galician

A collection of read speech recordings in Galician.

License: CC0-1.0

Locale: gl

Task: ASR

Format: MP3

Size: 7.81 GB