Datasets

Institute of African Digital Humanities

Ewondo-French Parallel Corpus

This dataset is a parallel corpus of Ewondo to French texts. Text were obtained by transcription of raw audio files. Translation were added to enrich the ori...

Gear IconTask: MT

Folder IconFormat: TSV

License IconLicense: NOODL-1.0

Size: 137.84 KB

Calendar IconCreated: 11/8/2025

Globe IconLocale: ewo, fr

Open Home Foundation

Dimitar 1.0

Text to speech dataset for Bulgarian, male speaker, approximately 2 hours of read speech.

Gear IconTask: TTS

Folder IconFormat: WEBM

License IconLicense: CC0-1.0

Size: 109.58 MB

Calendar IconCreated: 11/7/2025

Globe IconLocale: bg-BG

Tamahi Suneha Magazine

Punjabi Literature Corpus

This corpus contains 10,39,430 tokens of Punjabi Shahmukhi script.

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 1.83 MB

Calendar IconCreated: 11/7/2025

Globe IconLocale: pa-PK

Sujaak Adbi Sangat

Saraiki Quarterly Magazine Wasson Wehray Corpus

This corpus contains 11,79,200 tokens, from a Saraiki Quarterly Magazine "Wasson Wehray"

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 2.09 MB

Calendar IconCreated: 11/7/2025

Globe IconLocale: skr

Rana Printers Multan

Urdu Literature Corpus

This corpus contains 16,82,700 tokens of multiple Urdu language books.

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 3.00 MB

Calendar IconCreated: 11/7/2025

Globe IconLocale: ur

Bismillah Graphics Publishers

Urdu Literature Corpus

This corpus contains 16,17,074 tokens of multiple Urdu literature books published by Bismillah Graphics Publishers.

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 2.86 MB

Calendar IconCreated: 11/7/2025

Globe IconLocale: ur

Kaleem Art Press

Urdu Literature Corpus

This corpus contains multiple Urdu Language books of Stories, Short Stories, Novel, Travelogues, Poetry, Biography, Literature, History and other literary da...

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 2.85 MB

Calendar IconCreated: 11/6/2025

Globe IconLocale: ur

Kaleem Art Press

Saraiki Literature Corpus

This contains multiple Saraiki Language books of Stories, Short Stories, Novel, Travelogue, Sentences and collection of articles.

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 1.84 MB

Calendar IconCreated: 11/6/2025

Globe IconLocale: skr

Unknown Organization

Podcast Hari Minggoean (Indonesia)

This dataset is derived from the "Hari Minggoean" podcast, featuring over ten hours of recorded speech from a single, consistent speaker. The content, tailor...

Gear IconTask: ASR

Folder IconFormat: mp3

License IconLicense: CC-BY-SA-4.0

Size: 338.92 MB

Calendar IconCreated: 11/5/2025

Globe IconLocale: id-ID

Kaltepetlahtol

Tetelancingo Nahuatl

Audio del Nahuatl de la Sierra Oeste de Puebla, transcrito, ortográficamente normalizado, traducido, y etiquetado

Gear IconTask: ASR

Folder IconFormat: .tsv, .wav

License IconLicense: CC-BY-NC-4.0

Size: 952.98 MB

Calendar IconCreated: 11/4/2025

Globe IconLocale: nhi

Common Voice

Mozilla Common Voice Spontaneous Speech ASR Shared Task Train/Dev Data

This datasheet is for the bundle of Mozilla Common Voice spontaneous speech datasets to be used in the Shared Task on Spontaneous Speech.

Gear IconTask: ASR

Folder IconFormat: mp3

License IconLicense: CC0-1.0

Size: 4.30 GB

Calendar IconCreated: 9/25/2025

Globe IconLocale: mul

Common Voice

Common Voice Spontaneous Speech 1.0 - Papantla Totonac

A collection of spontaneous spoken phrases in Papantla Totonac.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 205.70 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: top

Common Voice

Common Voice Spontaneous Speech 1.0 - Rutoro

A collection of spontaneous spoken phrases in Rutoro.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 272.80 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: ttj

Common Voice

Common Voice Spontaneous Speech 1.0 - Kuku

A collection of spontaneous spoken phrases in Kuku.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 237.60 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: ukv

Common Voice

Common Voice Spontaneous Speech 1.0 - Sena

A collection of spontaneous spoken phrases in Sena.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 4.40 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: seh

Common Voice

Common Voice Spontaneous Speech 1.0 - Central Melanau

A collection of spontaneous spoken phrases in Central Melanau.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 208.60 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: mel

Common Voice

Common Voice Spontaneous Speech 1.0 - Kenyah

A collection of spontaneous spoken phrases in Kenyah.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 212.30 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: xkl

Common Voice

Common Voice Spontaneous Speech 1.0 - Ruuli

A collection of spontaneous spoken phrases in Ruuli.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 365.20 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: ruc

Common Voice

Common Voice Spontaneous Speech 1.0 - Michoacán Mazahua

A collection of spontaneous spoken phrases in Michoacán Mazahua.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 225.70 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: mmc

Common Voice

Common Voice Spontaneous Speech 1.0 - Sabah Malay

A collection of spontaneous spoken phrases in Sabah Malay.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 277.20 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: msi

Common Voice

Common Voice Spontaneous Speech 1.0 - Southwestern Tlaxiaco Mixtec

A collection of spontaneous spoken phrases in Southwestern Tlaxiaco Mixtec.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 201.80 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: meh

Common Voice

Common Voice Spontaneous Speech 1.0 - Toba Qom

A collection of spontaneous spoken phrases in Toba Qom.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 172.50 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: tob

Common Voice

Common Voice Spontaneous Speech 1.0 - Amba

A collection of spontaneous spoken phrases in Amba.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 265.80 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: rwm

Common Voice

Common Voice Spontaneous Speech 1.0 - Turkish

A collection of spontaneous spoken phrases in Turkish.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 3.10 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: tr