Datasets
Institute of African Digital Humanities
Ewondo-French Parallel Corpus
This dataset is a parallel corpus of Ewondo to French texts. Text were obtained by transcription of raw audio files. Translation were added to enrich the ori...
Task: MT
Format: TSV
License: NOODL-1.0
Size: 137.84 KB
Created: 11/8/2025
Locale: ewo, fr
Open Home Foundation
Dimitar 1.0
Text to speech dataset for Bulgarian, male speaker, approximately 2 hours of read speech.
Task: TTS
Format: WEBM
License: CC0-1.0
Size: 109.58 MB
Created: 11/7/2025
Locale: bg-BG
Tamahi Suneha Magazine
Punjabi Literature Corpus
This corpus contains 10,39,430 tokens of Punjabi Shahmukhi script.
Task: OTH
Format: TXT
License: CC-BY-NC-4.0
Size: 1.83 MB
Created: 11/7/2025
Locale: pa-PK
Sujaak Adbi Sangat
Saraiki Quarterly Magazine Wasson Wehray Corpus
This corpus contains 11,79,200 tokens, from a Saraiki Quarterly Magazine "Wasson Wehray"
Task: OTH
Format: TXT
License: CC-BY-NC-4.0
Size: 2.09 MB
Created: 11/7/2025
Locale: skr
Rana Printers Multan
Urdu Literature Corpus
This corpus contains 16,82,700 tokens of multiple Urdu language books.
Task: OTH
Format: TXT
License: CC-BY-NC-4.0
Size: 3.00 MB
Created: 11/7/2025
Locale: ur
Bismillah Graphics Publishers
Urdu Literature Corpus
This corpus contains 16,17,074 tokens of multiple Urdu literature books published by Bismillah Graphics Publishers.
Task: OTH
Format: TXT
License: CC-BY-NC-4.0
Size: 2.86 MB
Created: 11/7/2025
Locale: ur
Kaleem Art Press
Urdu Literature Corpus
This corpus contains multiple Urdu Language books of Stories, Short Stories, Novel, Travelogues, Poetry, Biography, Literature, History and other literary da...
Task: OTH
Format: TXT
License: CC-BY-NC-4.0
Size: 2.85 MB
Created: 11/6/2025
Locale: ur
Kaleem Art Press
Saraiki Literature Corpus
This contains multiple Saraiki Language books of Stories, Short Stories, Novel, Travelogue, Sentences and collection of articles.
Task: OTH
Format: TXT
License: CC-BY-NC-4.0
Size: 1.84 MB
Created: 11/6/2025
Locale: skr
Unknown Organization
Podcast Hari Minggoean (Indonesia)
This dataset is derived from the "Hari Minggoean" podcast, featuring over ten hours of recorded speech from a single, consistent speaker. The content, tailor...
Task: ASR
Format: mp3
License: CC-BY-SA-4.0
Size: 338.92 MB
Created: 11/5/2025
Locale: id-ID
Kaltepetlahtol
Tetelancingo Nahuatl
Audio del Nahuatl de la Sierra Oeste de Puebla, transcrito, ortográficamente normalizado, traducido, y etiquetado
Task: ASR
Format: .tsv, .wav
License: CC-BY-NC-4.0
Size: 952.98 MB
Created: 11/4/2025
Locale: nhi
Common Voice
Mozilla Common Voice Spontaneous Speech ASR Shared Task Train/Dev Data
This datasheet is for the bundle of Mozilla Common Voice spontaneous speech datasets to be used in the Shared Task on Spontaneous Speech.
Task: ASR
Format: mp3
License: CC0-1.0
Size: 4.30 GB
Created: 9/25/2025
Locale: mul
Common Voice
Common Voice Spontaneous Speech 1.0 - Papantla Totonac
A collection of spontaneous spoken phrases in Papantla Totonac.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 205.70 MB
Created: 9/15/2025
Locale: top
Common Voice
Common Voice Spontaneous Speech 1.0 - Rutoro
A collection of spontaneous spoken phrases in Rutoro.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 272.80 MB
Created: 9/15/2025
Locale: ttj
Common Voice
Common Voice Spontaneous Speech 1.0 - Kuku
A collection of spontaneous spoken phrases in Kuku.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 237.60 MB
Created: 9/15/2025
Locale: ukv
Common Voice
Common Voice Spontaneous Speech 1.0 - Sena
A collection of spontaneous spoken phrases in Sena.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 4.40 MB
Created: 9/15/2025
Locale: seh
Common Voice
Common Voice Spontaneous Speech 1.0 - Central Melanau
A collection of spontaneous spoken phrases in Central Melanau.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 208.60 MB
Created: 9/15/2025
Locale: mel
Common Voice
Common Voice Spontaneous Speech 1.0 - Kenyah
A collection of spontaneous spoken phrases in Kenyah.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 212.30 MB
Created: 9/15/2025
Locale: xkl
Common Voice
Common Voice Spontaneous Speech 1.0 - Ruuli
A collection of spontaneous spoken phrases in Ruuli.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 365.20 MB
Created: 9/15/2025
Locale: ruc
Common Voice
Common Voice Spontaneous Speech 1.0 - Michoacán Mazahua
A collection of spontaneous spoken phrases in Michoacán Mazahua.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 225.70 MB
Created: 9/15/2025
Locale: mmc
Common Voice
Common Voice Spontaneous Speech 1.0 - Sabah Malay
A collection of spontaneous spoken phrases in Sabah Malay.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 277.20 MB
Created: 9/15/2025
Locale: msi
Common Voice
Common Voice Spontaneous Speech 1.0 - Southwestern Tlaxiaco Mixtec
A collection of spontaneous spoken phrases in Southwestern Tlaxiaco Mixtec.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 201.80 MB
Created: 9/15/2025
Locale: meh
Common Voice
Common Voice Spontaneous Speech 1.0 - Toba Qom
A collection of spontaneous spoken phrases in Toba Qom.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 172.50 MB
Created: 9/15/2025
Locale: tob
Common Voice
Common Voice Spontaneous Speech 1.0 - Amba
A collection of spontaneous spoken phrases in Amba.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 265.80 MB
Created: 9/15/2025
Locale: rwm
Common Voice
Common Voice Spontaneous Speech 1.0 - Turkish
A collection of spontaneous spoken phrases in Turkish.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 3.10 MB
Created: 9/15/2025
Locale: tr
