Datasets
Jazab Publishers
Jazab Sindhi Newspaper Corpus
The corpus contains 1.07 million tokens from the Jazab a Sindhi Newspaper published from the year 2023-2025. The text consists of the complete newspaper cont...
Task: NLP
Format: TXT
License: CC-BY-NC-SA-4.0
Size: 2.33 MB
Created: 11/9/2025
Locale: snd
Tamir News Agency
Tamir Sindhi News Corpus
The corpus contains 1.1 million tokens from the Tamir Sindhi Newspaper published from the year 2022-2025. The text consists of the complete newspaper content...
Task: NLP
Format: TXT
License: CC-BY-NC-SA-4.0
Size: 2.56 MB
Created: 11/9/2025
Locale: snd
MEDIAMEN
Mediamen Punjabi Literature Corpus
This corpus is a collection of one million tokens of Western Punjabi language. The data was produced under the Mediamen publishing agency over the last ten...
Task: NLP
Format: TXT
License: CC-BY-NC-4.0
Size: 1.82 MB
Created: 11/9/2025
Locale: pnb
Unknown Organization
Speech Corpus of Armenian Question-Answer Dialogues
A collection of question-answer dialogues in Western and Eastern Armenian.
Task: ASR
Format: WAV, TEXTGRID, TXT
License: GPL-3.0
Size: 2.10 GB
Created: 11/8/2025
Locale: hy
Institute of African Digital Humanities
Ewondo-French Parallel Corpus
This dataset is a parallel corpus of Ewondo to French texts. Text were obtained by transcription of raw audio files. Translation were added to enrich the ori...
Task: MT
Format: TSV
License: NOODL-1.0
Size: 137.84 KB
Created: 11/8/2025
Locale: ewo, fr
Open Home Foundation
Dimitar 1.0
Text to speech dataset for Bulgarian, male speaker, approximately 2 hours of read speech.
Task: TTS
Format: WEBM
License: CC0-1.0
Size: 109.58 MB
Created: 11/7/2025
Locale: bg-BG
Tamahi Suneha Magazine
Punjabi Literature Corpus
This corpus contains 10,39,430 tokens of Punjabi Shahmukhi script.
Task: OTH
Format: TXT
License: CC-BY-NC-4.0
Size: 1.83 MB
Created: 11/7/2025
Locale: pa-PK
Sujaak Adbi Sangat
Saraiki Quarterly Magazine Wasson Wehray Corpus
This corpus contains 11,79,200 tokens, from a Saraiki Quarterly Magazine "Wasson Wehray"
Task: OTH
Format: TXT
License: CC-BY-NC-4.0
Size: 2.09 MB
Created: 11/7/2025
Locale: skr
Bismillah Graphics Publishers
Urdu Literature Corpus
This corpus contains 16,17,074 tokens of multiple Urdu literature books published by Bismillah Graphics Publishers.
Task: OTH
Format: TXT
License: CC-BY-NC-4.0
Size: 2.86 MB
Created: 11/7/2025
Locale: ur
Kaleem Art Press
Saraiki Literature Corpus
This contains multiple Saraiki Language books of Stories, Short Stories, Novel, Travelogue, Sentences and collection of articles.
Task: OTH
Format: TXT
License: CC-BY-NC-4.0
Size: 1.84 MB
Created: 11/6/2025
Locale: skr
Unknown Organization
Podcast Hari Minggoean (Indonesia)
This dataset is derived from the "Hari Minggoean" podcast, featuring over ten hours of recorded speech from a single, consistent speaker. The content, tailor...
Task: ASR
Format: mp3
License: CC-BY-SA-4.0
Size: 338.92 MB
Created: 11/5/2025
Locale: id-ID
Kaltepetlahtol
Tetelancingo Nahuatl
Audio del Nahuatl de la Sierra Oeste de Puebla, transcrito, ortográficamente normalizado, traducido, y etiquetado
Task: ASR
Format: .tsv, .wav
License: CC-BY-NC-4.0
Size: 952.98 MB
Created: 11/4/2025
Locale: nhi
Common Voice
Mozilla Common Voice Spontaneous Speech ASR Shared Task Train/Dev Data
This datasheet is for the bundle of Mozilla Common Voice spontaneous speech datasets to be used in the Shared Task on Spontaneous Speech.
Task: ASR
Format: mp3
License: CC0-1.0
Size: 4.30 GB
Created: 9/25/2025
Locale: mul
Common Voice
Common Voice Spontaneous Speech 1.0 - Papantla Totonac
A collection of spontaneous spoken phrases in Papantla Totonac.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 205.70 MB
Created: 9/15/2025
Locale: top
Common Voice
Common Voice Spontaneous Speech 1.0 - Rutoro
A collection of spontaneous spoken phrases in Rutoro.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 272.80 MB
Created: 9/15/2025
Locale: ttj
Common Voice
Common Voice Spontaneous Speech 1.0 - Kuku
A collection of spontaneous spoken phrases in Kuku.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 237.60 MB
Created: 9/15/2025
Locale: ukv
Common Voice
Common Voice Spontaneous Speech 1.0 - Sena
A collection of spontaneous spoken phrases in Sena.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 4.40 MB
Created: 9/15/2025
Locale: seh
Common Voice
Common Voice Spontaneous Speech 1.0 - Central Melanau
A collection of spontaneous spoken phrases in Central Melanau.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 208.60 MB
Created: 9/15/2025
Locale: mel
Common Voice
Common Voice Spontaneous Speech 1.0 - Kenyah
A collection of spontaneous spoken phrases in Kenyah.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 212.30 MB
Created: 9/15/2025
Locale: xkl
Common Voice
Common Voice Spontaneous Speech 1.0 - Ruuli
A collection of spontaneous spoken phrases in Ruuli.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 365.20 MB
Created: 9/15/2025
Locale: ruc
Common Voice
Common Voice Spontaneous Speech 1.0 - Michoacán Mazahua
A collection of spontaneous spoken phrases in Michoacán Mazahua.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 225.70 MB
Created: 9/15/2025
Locale: mmc
Common Voice
Common Voice Spontaneous Speech 1.0 - Sabah Malay
A collection of spontaneous spoken phrases in Sabah Malay.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 277.20 MB
Created: 9/15/2025
Locale: msi
Common Voice
Common Voice Spontaneous Speech 1.0 - Southwestern Tlaxiaco Mixtec
A collection of spontaneous spoken phrases in Southwestern Tlaxiaco Mixtec.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 201.80 MB
Created: 9/15/2025
Locale: meh
Common Voice
Common Voice Spontaneous Speech 1.0 - Toba Qom
A collection of spontaneous spoken phrases in Toba Qom.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 172.50 MB
Created: 9/15/2025
Locale: tob
