Datasets

Jazab Publishers

Jazab Sindhi Newspaper Corpus

The corpus contains 1.07 million tokens from the Jazab a Sindhi Newspaper published from the year 2023-2025. The text consists of the complete newspaper cont...

Gear IconTask: NLP

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-SA-4.0

Size: 2.33 MB

Calendar IconCreated: 11/9/2025

Globe IconLocale: snd

Tamir News Agency

Tamir Sindhi News Corpus

The corpus contains 1.1 million tokens from the Tamir Sindhi Newspaper published from the year 2022-2025. The text consists of the complete newspaper content...

Gear IconTask: NLP

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-SA-4.0

Size: 2.56 MB

Calendar IconCreated: 11/9/2025

Globe IconLocale: snd

MEDIAMEN

Mediamen Punjabi Literature Corpus

This corpus is a collection of one million tokens of Western Punjabi language. The data was produced under the Mediamen publishing agency over the last ten...

Gear IconTask: NLP

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 1.82 MB

Calendar IconCreated: 11/9/2025

Globe IconLocale: pnb

Unknown Organization

Speech Corpus of Armenian Question-Answer Dialogues

A collection of question-answer dialogues in Western and Eastern Armenian.

Gear IconTask: ASR

Folder IconFormat: WAV, TEXTGRID, TXT

License IconLicense: GPL-3.0

Size: 2.10 GB

Calendar IconCreated: 11/8/2025

Globe IconLocale: hy

Institute of African Digital Humanities

Ewondo-French Parallel Corpus

This dataset is a parallel corpus of Ewondo to French texts. Text were obtained by transcription of raw audio files. Translation were added to enrich the ori...

Gear IconTask: MT

Folder IconFormat: TSV

License IconLicense: NOODL-1.0

Size: 137.84 KB

Calendar IconCreated: 11/8/2025

Globe IconLocale: ewo, fr

Open Home Foundation

Dimitar 1.0

Text to speech dataset for Bulgarian, male speaker, approximately 2 hours of read speech.

Gear IconTask: TTS

Folder IconFormat: WEBM

License IconLicense: CC0-1.0

Size: 109.58 MB

Calendar IconCreated: 11/7/2025

Globe IconLocale: bg-BG

Tamahi Suneha Magazine

Punjabi Literature Corpus

This corpus contains 10,39,430 tokens of Punjabi Shahmukhi script.

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 1.83 MB

Calendar IconCreated: 11/7/2025

Globe IconLocale: pa-PK

Sujaak Adbi Sangat

Saraiki Quarterly Magazine Wasson Wehray Corpus

This corpus contains 11,79,200 tokens, from a Saraiki Quarterly Magazine "Wasson Wehray"

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 2.09 MB

Calendar IconCreated: 11/7/2025

Globe IconLocale: skr

Bismillah Graphics Publishers

Urdu Literature Corpus

This corpus contains 16,17,074 tokens of multiple Urdu literature books published by Bismillah Graphics Publishers.

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 2.86 MB

Calendar IconCreated: 11/7/2025

Globe IconLocale: ur

Kaleem Art Press

Saraiki Literature Corpus

This contains multiple Saraiki Language books of Stories, Short Stories, Novel, Travelogue, Sentences and collection of articles.

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 1.84 MB

Calendar IconCreated: 11/6/2025

Globe IconLocale: skr

Unknown Organization

Podcast Hari Minggoean (Indonesia)

This dataset is derived from the "Hari Minggoean" podcast, featuring over ten hours of recorded speech from a single, consistent speaker. The content, tailor...

Gear IconTask: ASR

Folder IconFormat: mp3

License IconLicense: CC-BY-SA-4.0

Size: 338.92 MB

Calendar IconCreated: 11/5/2025

Globe IconLocale: id-ID

Kaltepetlahtol

Tetelancingo Nahuatl

Audio del Nahuatl de la Sierra Oeste de Puebla, transcrito, ortográficamente normalizado, traducido, y etiquetado

Gear IconTask: ASR

Folder IconFormat: .tsv, .wav

License IconLicense: CC-BY-NC-4.0

Size: 952.98 MB

Calendar IconCreated: 11/4/2025

Globe IconLocale: nhi

Common Voice

Mozilla Common Voice Spontaneous Speech ASR Shared Task Train/Dev Data

This datasheet is for the bundle of Mozilla Common Voice spontaneous speech datasets to be used in the Shared Task on Spontaneous Speech.

Gear IconTask: ASR

Folder IconFormat: mp3

License IconLicense: CC0-1.0

Size: 4.30 GB

Calendar IconCreated: 9/25/2025

Globe IconLocale: mul

Common Voice

Common Voice Spontaneous Speech 1.0 - Papantla Totonac

A collection of spontaneous spoken phrases in Papantla Totonac.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 205.70 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: top

Common Voice

Common Voice Spontaneous Speech 1.0 - Rutoro

A collection of spontaneous spoken phrases in Rutoro.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 272.80 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: ttj

Common Voice

Common Voice Spontaneous Speech 1.0 - Kuku

A collection of spontaneous spoken phrases in Kuku.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 237.60 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: ukv

Common Voice

Common Voice Spontaneous Speech 1.0 - Sena

A collection of spontaneous spoken phrases in Sena.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 4.40 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: seh

Common Voice

Common Voice Spontaneous Speech 1.0 - Central Melanau

A collection of spontaneous spoken phrases in Central Melanau.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 208.60 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: mel

Common Voice

Common Voice Spontaneous Speech 1.0 - Kenyah

A collection of spontaneous spoken phrases in Kenyah.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 212.30 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: xkl

Common Voice

Common Voice Spontaneous Speech 1.0 - Ruuli

A collection of spontaneous spoken phrases in Ruuli.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 365.20 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: ruc

Common Voice

Common Voice Spontaneous Speech 1.0 - Michoacán Mazahua

A collection of spontaneous spoken phrases in Michoacán Mazahua.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 225.70 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: mmc

Common Voice

Common Voice Spontaneous Speech 1.0 - Sabah Malay

A collection of spontaneous spoken phrases in Sabah Malay.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 277.20 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: msi

Common Voice

Common Voice Spontaneous Speech 1.0 - Southwestern Tlaxiaco Mixtec

A collection of spontaneous spoken phrases in Southwestern Tlaxiaco Mixtec.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 201.80 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: meh

Common Voice

Common Voice Spontaneous Speech 1.0 - Toba Qom

A collection of spontaneous spoken phrases in Toba Qom.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 172.50 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: tob