Datasets

Filters:
Institute of African Digital Humanities

Ewondo-French Parallel Corpus

This dataset is a parallel corpus of Ewondo to French texts. Text were obtained by transcription of raw audio files. Translation were added to enrich the original corpus. Alignment of Ewondo and French texts were made in the process of creating this dataset.
License Icon

License: NOODL-1.0

Locale Icon

Locale: ewo, fr

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 137.84 KB

Open Home Foundation

Dimitar 1.0

Text to speech dataset for Bulgarian, male speaker, approximately 2 hours of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: bg-BG

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 109.58 MB

Tamahi Suneha Magazine

Punjabi Literature Corpus

This corpus contains 10,39,430 tokens of Punjabi Shahmukhi script.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: pa-PK

Task Icon

Task: OTH

Format Icon

Format: TXT

Size Icon

Size: 1.83 MB

Sujaak Adbi Sangat

Saraiki Quarterly Magazine Wasson Wehray Corpus

This corpus contains 11,79,200 tokens, from a Saraiki Quarterly Magazine "Wasson Wehray"
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: skr

Task Icon

Task: OTH

Format Icon

Format: TXT

Size Icon

Size: 2.09 MB

Bismillah Graphics Publishers

Urdu Literature Corpus

This corpus contains 16,17,074 tokens of multiple Urdu literature books published by Bismillah Graphics Publishers.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: ur

Task Icon

Task: OTH

Format Icon

Format: TXT

Size Icon

Size: 2.86 MB

Kaleem Art Press

Saraiki Literature Corpus

This contains multiple Saraiki Language books of Stories, Short Stories, Novel, Travelogue, Sentences and collection of articles.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: skr

Task Icon

Task: OTH

Format Icon

Format: TXT

Size Icon

Size: 1.84 MB

Community

Podcast Hari Minggoean (Indonesia)

This dataset is derived from the "Hari Minggoean" podcast, featuring over ten hours of recorded speech from a single, consistent speaker. The content, tailored for a young Indonesian audience, is presented in Indonesian (Bahasa Indonesia) characterized by code-switching with English and a discernible Javanese accent. The collection is comprised of 42 individual audio files (10+ hours). Sample Tapi dari pelafalan, dari intonasi, dari jedanya dia bicara. That's really good. Dan aku sebenarnya suka banget ketika dia ngomong. Yang btw, soal ekspresif tadi. Aku jadi kepikiran deh.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: id-ID

Task Icon

Task: ASR

Format Icon

Format: mp3

Size Icon

Size: 338.92 MB

Kaltepetlahtol

Tetelancingo Nahuatl

Audio del Nahuatl de la Sierra Oeste de Puebla, transcrito, ortográficamente normalizado, traducido, y etiquetado
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: nhi

Task Icon

Task: ASR

Format Icon

Format: .tsv, .wav

Size Icon

Size: 952.98 MB

Common Voice

Mozilla Common Voice Spontaneous Speech ASR Shared Task Train/Dev Data

This datasheet is for the bundle of Mozilla Common Voice spontaneous speech datasets to be used in the Shared Task on Spontaneous Speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: mul

Task Icon

Task: ASR

Format Icon

Format: mp3

Size Icon

Size: 4.30 GB