Datasets
Saraiki-English Parallel Corpus
License: CC-BY-NC-4.0
Locale: mul
Task: MT
Format: CSV
Size: 1.92 MB
Effect AI Scripted Speech 1.0 - English
License: CC0-1.0
Locale: en
Task: TTS
Format: CSV, MP3
Size: 663.45 MB
English–Punjabi (Shahmukhi) Parallel Sentences Corpus (Mediamen Archives)
License: CC-BY-NC-4.0
Locale: en-PK, pnb
Task: MT
Format: CSV
Size: 1.08 MB
Common Voice Scripted Speech 24.0 - English
License: CC0-1.0
Locale: en
Task: ASR
Format: MP3
Size: 87.74 GB
Bangor Miami Spanish-English Corpus
License: GPL-3.0
Locale: es-US, en-US
Task: ASR
Format: MP3, CHA, TSV
Size: 1.12 GB
Common Voice Spontaneous Speech 1.0 - English
License: CC0-1.0
Locale: en
Task: ASR
Format: MP3
Size: 128.69 MB
Common Voice v24 English - en-AU subset for Everything Open 2026
License: CC0-1.0
Locale: en-AU
Task: ASR
Format: CSV, MP3
Size: 1.92 GB
Common Voice Scripted Speech 24.0 - Nigerian Pidgin English
License: CC0-1.0
Locale: pcm
Task: ASR
Format: MP3
Size: 294.05 MB
Bangor Siarad Welsh-English Corpus
License: GPL-3.0
Locale: cym
Task: ASR
Format: MP3, CHA. TSV
Size: 2.13 GB
Kathleen 1.0
License: CC0-1.0
Locale: en-US
Task: TTS
Format: FLAC
Size: 211.96 MB
IBT Torwali Wordlist
License: CC-BY-SA-4.0
Locale: trw
Task: NLP
Format: CSV
Size: 312.87 KB
Sermon-Malaysian-English
License: CC-BY-NC-4.0
Locale: en-MY
Task: ASR
Format: MP4, TXT, SRT
Size: 6.63 MB
Joe 1.0
License: CC0-1.0
Locale: en-US
Task: TTS
Format: WEBM
Size: 75.78 MB
Multilingual Religious Parallel Corpus (Kaleem Art Press)
License: CC-BY-SA-4.0
Locale: mul
Task: MT
Format: CSV
Size: 2.27 MB
Finance Sentences - North American Spanish
License: CC0-1.0
Locale: es-US
Task: NLP
Format: TSV, JSON
Size: 18.35 MB
TTS Central Javanese
License: CC-BY-SA-4.0
Locale: jav
Task: TTS
Format: WEBM, TSV
Size: 440.11 MB
Balochi Academy Text Corpus
License: CC-BY-NC-SA-4.0
Locale: bgn
Task: NLP
Format: TXT
Size: 1.88 MB
Atyap Afwan_: Preserving Tyap Through Community-Driven Speech Data
License: CC-BY-NC-SA-4.0
Locale: kcg
Task: NLP
Format: WAV, TXT
Size: 251.51 MB
Common Voice Scripted Speech 24.0 - Mina
License: CC0-1.0
Locale: gej
Task: ASR
Format: MP3
Size: 215.45 MB
Brahui Research Work Corpus
License: CC-BY-NC-SA-4.0
Locale: brh
Task: NLP
Format: TXT
Size: 1.13 MB
Podcast Homostoria (Indonesia)
License: CC-BY-SA-4.0
Locale: id
Task: ASR
Format: mp3
Size: 302.97 MB
Gawri (گاؤری) Magazine Corpus
License: CC-BY-NC-4.0
Locale: gwc
Task: NLP
Format: TXT
Size: 146.71 KB
Jember Javanese Spontaneous Speech Corpus
License: CC-BY-NC-SA-4.0
Locale: jav
Task: ASR
Format: MP3, TSV
Size: 271.65 MB
Baloch Publishers Saraiki Literature Corpus
License: CC-BY-NC-4.0
Locale: skr
Task: NLP
Format: TXT
Size: 2.04 MB