Datasets

Filters:
Institute of African Digital Humanities

Bamun-French Parallel Corpus 2.0

This dataset is an extended and updated version of the "Bamun-French Parallel Corpus 1.1", a parallel corpus of 4,444 lines in Bamun and French.
License Icon

License: NOODL-1.0

Locale Icon

Locale: bax

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 184.29 KB

Common Voice

Common Voice Scripted Speech 25.0 - Kinyarwanda

A collection of read speech recordings in Kinyarwanda.
License Icon

License: CC0-1.0

Locale Icon

Locale: rw

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 57.18 GB

Common Voice

Common Voice Scripted Speech 25.0 - French

A collection of read speech recordings in French.
License Icon

License: CC0-1.0

Locale Icon

Locale: fr

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 28.39 GB

Common Voice

Common Voice Scripted Speech 25.0 - Spanish

A collection of read speech recordings in Spanish.
License Icon

License: CC0-1.0

Locale Icon

Locale: es

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 48.23 GB

Community

Araina Text Corpus (Occitan Aranese)

Text corpus in Aranese variety of Gascon dialect of Occitan
License Icon

License: CC0-1.0

Locale Icon

Locale: oc

Task Icon

Task: LM

Format Icon

Format: txt

Size Icon

Size: 22.97 MB

Common Voice

Common Voice Scripted Speech 25.0 - Belarusian

A collection of read speech recordings in Belarusian.
License Icon

License: CC0-1.0

Locale Icon

Locale: be

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 36.21 GB

MDC Curators

Corpus de llenguatge ofensiu en català

This dataset consists of sentences tagged as offensive-language in the version 25.0 release of Mozilla Common Voice in Catalan.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: ca

Task Icon

Task: NLP

Format Icon

Format: TSV

Size Icon

Size: 57.35 KB

Common Voice

Common Voice Scripted Speech 25.0 - German

A collection of read speech recordings in German.
License Icon

License: CC0-1.0

Locale Icon

Locale: de

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 34.69 GB

Common Voice

Common Voice Scripted Speech 25.0 - Esperanto

A collection of read speech recordings in Esperanto.
License Icon

License: CC0-1.0

Locale Icon

Locale: eo

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 39.00 GB

Community

Oro_Word

Afaan Oromoo word-level speech dataset collected to support open-source speech recognition and text-to-speech technology.
License Icon

License: CC0-1.0

Locale Icon

Locale: om

Task Icon

Task: TTS

Format Icon

Format: .WAV, CSV

Size Icon

Size: 1.28 MB

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

INEL Kalmyk Speech Corpus

A 3-hour supervised Speech-to-Text dataset for Kalmyk, a Mongolic language. Features sentence-level audio aligned with scientific text transcriptions.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: xal

Task Icon

Task: ASR

Format Icon

Format: TSV, MP3

Size Icon

Size: 138.31 MB

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

INEL Nganasan Speech Corpus

A 38.5-hour Speech-to-Text dataset for Nganasan, an endangered Samoyedic language. Features audio aligned with text transcriptions.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: nio

Task Icon

Task: ASR

Format Icon

Format: TSV, MP3

Size Icon

Size: 1.29 GB

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

INEL Evenki Speech Corpus

A 2.5-hour supervised Speech-to-Text dataset for Evenki, an endangered Tungusic language. Features sentence-level audio aligned with text transcriptions.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: evn

Task Icon

Task: ASR

Format Icon

Format: TSV, MP3

Size Icon

Size: 103.03 MB

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

INEL Dolgan Speech Corpus

A 13-hour supervised Speech-to-Text dataset for Dolgan, an endangered Turkic language. Features sentence-level audio aligned with text transcriptions.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: dlg

Task Icon

Task: ASR

Format Icon

Format: TSV, MP3

Size Icon

Size: 583.34 MB

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

INEL Kamas Speech Corpus

A 14-hour Speech-to-Text dataset for Kamas, an extinct Samoyedic language waiting to be revitalized. Features audio aligned with text transcriptions.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: xas

Task Icon

Task: ASR

Format Icon

Format: TSV, MP3

Size Icon

Size: 376.64 MB

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

INEL Selkup Speech Corpus

A 1.6-hour supervised Speech-to-Text dataset for Selkup, an endangered Samoyedic language. Features sentence-level audio-text alignments.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: sel

Task Icon

Task: ASR

Format Icon

Format: TSV, MP3

Size Icon

Size: 45.46 MB

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

INEL Enets Speech Corpus

A 3.7-hour supervised Speech-to-Text dataset for Enets, an endangered Samoyedic language. Features sentence-level audio aligned with text transcriptions.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: enf, enh

Task Icon

Task: ASR

Format Icon

Format: TSV, MP3

Size Icon

Size: 140.56 MB

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

INEL Nenets Speech Corpus

A supervised Speech-to-Text dataset for Nenets, a Samoyedic language. Features sentence-level audio aligned with Cyrillic text transcriptions.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: yrk

Task Icon

Task: ASR

Format Icon

Format: TSV, MP3

Size Icon

Size: 8.35 MB

Common Voice

Common Voice Scripted Speech 25.0 - Bengali

A collection of read speech recordings in Bengali.
License Icon

License: CC0-1.0

Locale Icon

Locale: bn

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 24.84 GB

Common Voice

Common Voice Scripted Speech 25.0 - Chinese (China)

A collection of read speech recordings in Chinese (China).
License Icon

License: CC0-1.0

Locale Icon

Locale: zh-CN

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 21.38 GB

LocaleNLP

English Hausa Parallel Corpus

An English–Hausa dataset with 5,000 sentence pairs useful for machine translation and basic language processing tasks.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: eng, hau

Task Icon

Task: MT

Format Icon

Format: csv

Size Icon

Size: 164.32 KB

Anjuman e Katib

Persian Literature Corpus by Najwai Sukhan

A curated Persian literary corpus of ~1.26M tokens spanning literature, poetry, educational writing, and culturally significant texts.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: fas

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 38.62 MB

Community

Heroes English-Spanish Dubbed Movie Speech Corpus

7000 single speaker speech segments from the original and Spanish dubbed version of 21 episodes of TV series Heroes
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: eng, spa

Task Icon

Task: NLP

Format Icon

Format: wav, csv, txt

Size Icon

Size: 1.68 GB

Common Voice

Common Voice Scripted Speech 25.0 - Swahili

A collection of read speech recordings in Swahili.
License Icon

License: CC0-1.0

Locale Icon

Locale: sw

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 20.87 GB