Datasets

Filters:

Sentence translation difficulty in English - BOUQuET

A collection of sentences in English from the BOUQuET benchmark (1990 sentences) which have been annotated with sentence translation difficulty scores.

License: CC-BY-SA-4.0

Locale: en

Task: NLP

Format: TSV

Size: 85.61 KB

Institute of African Digital Humanities

Bamun-TTS-Dataset

This dataset consists of segmented Bamun (Shupamem) speech audio clips paired with text, designed for Text-to-Speech (TTS) applications.

License: NOODL-1.0

Locale: bax

Task: TTS

Format: MP3, TSV

Size: 219.97 MB

GriôTech

Territórios Digitais

Dataset on community-driven responses to disinformation and AI in marginalized territories in Brazil, based on participatory research.

License: CC-BY-4.0

Locale: pt, en

Task: N/A

Format: DOCX, PDF, XLSX

Size: 4.24 MB

Taruen

Chuvash TTS

A ~5-hour speech dataset for Chuvash Text-to-Speech (TTS) research, featuring a single female speaker reading news and digits at a rapid tempo.

License: CC-BY-SA-4.0

Locale: cv

Task: TTS

Format: PARQUET

Size: 854.02 MB

RFERL

RFE/RL Persian News Text Corpus

This dataset is a longitudinal news corpus for the Persian language sourced from Radio Farda from 2001 to 2026. It contains over 350,000 articles (51M tokens).

License: CC-BY-NC-SA-4.0

Locale: fa

Task: NLP

Format: TXT

Size: 307.78 MB

MirasAI

Saraiki 10 Hours TTS Dataset

A 10-hour Saraiki text-to-speech dataset consisting of recorded speech and aligned transcripts, designed for speech synthesis research and development.

License: CC-BY-NC-SA-4.0

Locale: srk

Task: TTS

Format: WEBM, TSV

Size: 584.44 MB

MirasAI

Kannada Time Aligned Speech Corpus

A 5-hour Kannada speech dataset with time-aligned transcriptions, designed for ASR, forced alignment, and speech research.

License: CC-BY-NC-SA-4.0

Locale: kan

Task: ASR

Format: OGG, SRT

Size: 355.77 MB

MDC Curators

Sentence translation difficulty in Spanish - BOUQuET

A collection of sentences in Spanish from the BOUQuET benchmark (total 1990 sentences) which have been annotated with sentence translation difficulty scores.

License: CC-BY-SA-4.0

Locale: es

Task: MT

Format: TSV

Size: 81.48 KB

Institute of African Digital Humanities

Yezoum_ALCAM-MultimodalDataset

This dataset comprises aligned audio and text data in Yezoum with French equivalents.

License: NOODL-1.0

Locale: ewo

Task: NLP

Format: MP3, TSV

Size: 12.81 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Serian Bidayuh

A collection of spontaneous responses to questions in Serian Bidayuh.

License: CC0-1.0

Locale: sdo

Task: ASR

Format: MP3

Size: 201.26 MB

Common Voice

Common Voice Scripted Speech 25.0 - Pashto

A collection of read speech recordings in Pashto.

License: CC0-1.0

Locale: ps

Task: ASR

Format: MP3

Size: 97.81 GB

Common Voice

Common Voice Scripted Speech 25.0 - English

A collection of read speech recordings in English.

License: CC0-1.0

Locale: en

Task: ASR

Format: MP3

Size: 87.84 GB

Common Voice

Common Voice Scripted Speech 25.0 - Catalan

A collection of read speech recordings in Catalan.

License: CC0-1.0

Locale: ca

Task: ASR

Format: MP3

Size: 78.67 GB

Institute of African Digital Humanities

Bamun-French Parallel Corpus 2.0

This dataset is an extended and updated version of the "Bamun-French Parallel Corpus 1.1", a parallel corpus of 4,444 lines in Bamun and French.

License: NOODL-1.0

Locale: bax

Task: MT

Format: TSV

Size: 184.29 KB

Common Voice

Common Voice Scripted Speech 25.0 - Kinyarwanda

A collection of read speech recordings in Kinyarwanda.

License: CC0-1.0

Locale: rw

Task: ASR

Format: MP3

Size: 57.18 GB

Common Voice

Common Voice Scripted Speech 25.0 - French

A collection of read speech recordings in French.

License: CC0-1.0

Locale: fr

Task: ASR

Format: MP3

Size: 28.39 GB

Common Voice

Common Voice Scripted Speech 25.0 - Spanish

A collection of read speech recordings in Spanish.

License: CC0-1.0

Locale: es

Task: ASR

Format: MP3

Size: 48.23 GB

Community

Araina Text Corpus (Occitan Aranese)

Text corpus in Aranese variety of Gascon dialect of Occitan

License: CC0-1.0

Locale: oc

Task: LM

Format: txt

Size: 22.97 MB

Common Voice

Common Voice Scripted Speech 25.0 - Belarusian

A collection of read speech recordings in Belarusian.

License: CC0-1.0

Locale: be

Task: ASR

Format: MP3

Size: 36.21 GB

MDC Curators

Corpus de llenguatge ofensiu en català

This dataset consists of sentences tagged as offensive-language in the version 25.0 release of Mozilla Common Voice in Catalan.

License: CC-BY-SA-4.0

Locale: ca

Task: NLP

Format: TSV

Size: 57.35 KB

Common Voice

Common Voice Scripted Speech 25.0 - German

A collection of read speech recordings in German.

License: CC0-1.0

Locale: de

Task: ASR

Format: MP3

Size: 34.69 GB

Common Voice

Common Voice Scripted Speech 25.0 - Esperanto

A collection of read speech recordings in Esperanto.

License: CC0-1.0

Locale: eo

Task: ASR

Format: MP3

Size: 39.00 GB

Community

Oro_Word

Afaan Oromoo word-level speech dataset collected to support open-source speech recognition and text-to-speech technology.

License: CC0-1.0

Locale: om

Task: TTS

Format: .WAV, CSV

Size: 1.28 MB

Institute of Finno-Ugric/Uralic Studies, University of Hamburg

INEL Kalmyk Speech Corpus

A 3-hour supervised Speech-to-Text dataset for Kalmyk, a Mongolic language. Features sentence-level audio aligned with scientific text transcriptions.

License: CC-BY-NC-SA-4.0

Locale: xal

Task: ASR

Format: TSV, MP3

Size: 138.31 MB