Datasets

Filters:
MDC Curators

Sentence translation difficulty in English - BOUQuET

A collection of sentences in English from the BOUQuET benchmark (1990 sentences) which have been annotated with sentence translation difficulty scores.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: en

Task Icon

Task: NLP

Format Icon

Format: TSV

Size Icon

Size: 85.61 KB

Institute of African Digital Humanities

Bamun-TTS-Dataset

This dataset consists of segmented Bamun (Shupamem) speech audio clips paired with text, designed for Text-to-Speech (TTS) applications.
License Icon

License: NOODL-1.0

Locale Icon

Locale: bax

Task Icon

Task: TTS

Format Icon

Format: MP3, TSV

Size Icon

Size: 219.97 MB

GriôTech

Territórios Digitais

Dataset on community-driven responses to disinformation and AI in marginalized territories in Brazil, based on participatory research.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: pt, en

Task Icon

Task: N/A

Format Icon

Format: DOCX, PDF, XLSX

Size Icon

Size: 4.24 MB

Taruen

Chuvash TTS

A ~5-hour speech dataset for Chuvash Text-to-Speech (TTS) research, featuring a single female speaker reading news and digits at a rapid tempo.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: cv

Task Icon

Task: TTS

Format Icon

Format: PARQUET

Size Icon

Size: 854.02 MB

RFERL

RFE/RL Persian News Text Corpus

This dataset is a longitudinal news corpus for the Persian language sourced from Radio Farda from 2001 to 2026. It contains over 350,000 articles (51M tokens).
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: fa

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 307.78 MB

MirasAI

Saraiki 10 Hours TTS Dataset

A 10-hour Saraiki text-to-speech dataset consisting of recorded speech and aligned transcripts, designed for speech synthesis research and development.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: srk

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 584.44 MB

MirasAI

Kannada Time Aligned Speech Corpus

A 5-hour Kannada speech dataset with time-aligned transcriptions, designed for ASR, forced alignment, and speech research.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: kan

Task Icon

Task: ASR

Format Icon

Format: OGG, SRT

Size Icon

Size: 355.77 MB

MDC Curators

Sentence translation difficulty in Spanish - BOUQuET

A collection of sentences in Spanish from the BOUQuET benchmark (total 1990 sentences) which have been annotated with sentence translation difficulty scores.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: es

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 81.48 KB

Institute of African Digital Humanities

Yezoum_ALCAM-MultimodalDataset

This dataset comprises aligned audio and text data in Yezoum with French equivalents.
License Icon

License: NOODL-1.0

Locale Icon

Locale: ewo

Task Icon

Task: NLP

Format Icon

Format: MP3, TSV

Size Icon

Size: 12.81 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Serian Bidayuh

A collection of spontaneous responses to questions in Serian Bidayuh.
License Icon

License: CC0-1.0

Locale Icon

Locale: sdo

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 201.26 MB

Common Voice

Common Voice Scripted Speech 25.0 - Pashto

A collection of read speech recordings in Pashto.
License Icon

License: CC0-1.0

Locale Icon

Locale: ps

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 97.81 GB

Common Voice

Common Voice Scripted Speech 25.0 - English

A collection of read speech recordings in English.
License Icon

License: CC0-1.0

Locale Icon

Locale: en

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 87.84 GB

Common Voice

Common Voice Scripted Speech 25.0 - Catalan

A collection of read speech recordings in Catalan.
License Icon

License: CC0-1.0

Locale Icon

Locale: ca

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 78.67 GB

Institute of African Digital Humanities

Bamun-French Parallel Corpus 2.0

This dataset is an extended and updated version of the "Bamun-French Parallel Corpus 1.1", a parallel corpus of 4,444 lines in Bamun and French.
License Icon

License: NOODL-1.0

Locale Icon

Locale: bax

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 184.29 KB

Common Voice

Common Voice Scripted Speech 25.0 - Kinyarwanda

A collection of read speech recordings in Kinyarwanda.
License Icon

License: CC0-1.0

Locale Icon

Locale: rw

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 57.18 GB

Common Voice

Common Voice Scripted Speech 25.0 - French

A collection of read speech recordings in French.
License Icon

License: CC0-1.0

Locale Icon

Locale: fr

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 28.39 GB

Common Voice

Common Voice Scripted Speech 25.0 - Spanish

A collection of read speech recordings in Spanish.
License Icon

License: CC0-1.0

Locale Icon

Locale: es

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 48.23 GB

Community

Araina Text Corpus (Occitan Aranese)

Text corpus in Aranese variety of Gascon dialect of Occitan
License Icon

License: CC0-1.0

Locale Icon

Locale: oc

Task Icon

Task: LM

Format Icon

Format: txt

Size Icon

Size: 22.97 MB

Common Voice

Common Voice Scripted Speech 25.0 - Belarusian

A collection of read speech recordings in Belarusian.
License Icon

License: CC0-1.0

Locale Icon

Locale: be

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 36.21 GB

MDC Curators

Corpus de llenguatge ofensiu en català

This dataset consists of sentences tagged as offensive-language in the version 25.0 release of Mozilla Common Voice in Catalan.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: ca

Task Icon

Task: NLP

Format Icon

Format: TSV

Size Icon

Size: 57.35 KB

Common Voice

Common Voice Scripted Speech 25.0 - German

A collection of read speech recordings in German.
License Icon

License: CC0-1.0

Locale Icon

Locale: de

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 34.69 GB

Common Voice

Common Voice Scripted Speech 25.0 - Esperanto

A collection of read speech recordings in Esperanto.
License Icon

License: CC0-1.0

Locale Icon

Locale: eo

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 39.00 GB

Community

Oro_Word

Afaan Oromoo word-level speech dataset collected to support open-source speech recognition and text-to-speech technology.
License Icon

License: CC0-1.0

Locale Icon

Locale: om

Task Icon

Task: TTS

Format Icon

Format: .WAV, CSV

Size Icon

Size: 1.28 MB

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

INEL Kalmyk Speech Corpus

A 3-hour supervised Speech-to-Text dataset for Kalmyk, a Mongolic language. Features sentence-level audio aligned with scientific text transcriptions.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: xal

Task Icon

Task: ASR

Format Icon

Format: TSV, MP3

Size Icon

Size: 138.31 MB