Datasets

Filters:

Search results for “english”

Bojonegoro Javanese TTS

This dataset contains synthetic speech data covering a variety of everyday-life topics in the Bojonegoro dialect of Javanese, spoken in East Java, Indonesia.

License: CC-BY-SA-4.0

Locale: jav

Task: TTS

Format: .tar.gz, WEBM

Size: 469.50 MB

Common Voice

Common Voice Spontaneous Speech 2.0 - Thur

A collection of spontaneous spoken phrases in Thur.

License: CC0-1.0

Locale: lth

Task: ASR

Format: MP3

Size: 292.98 MB

Common Voice

Common Voice Scripted Speech 24.0 - Greek

A collection of scripted spoken phrases in Greek.

License: CC0-1.0

Locale: el

Task: ASR

Format: MP3

Size: 741.82 MB

Open Home Foundation

Tugão 1.0

Text to speech dataset for Portuguese, male speaker, approximately 1.5 hours of read speech.

License: CC0-1.0

Locale: pt-PT

Task: TTS

Format: WEBM

Size: 61.84 MB

Amara Hub

DataTrust Africa: Speech Corpus of Public Radio Recordings from Northern Uganda

This is an open-access corpus of short clips of public radio content from Mega 100 FM, Q FM, Radio Pacis and Radio Rupiny in Northern Uganda. As of now, the online corpus has over 350 clips of recordings in English. We also hope to add finely-annotated transcripts to them. The dataset is for use in NLP research and non-commercial use. Upcoming datasets to look out for from Amara Hub are public radio recordings in other languages spoken in the region like Acholi, Lango, Lugbara and Akaramajong.

License: NOODL-1.0

Locale: en-US

Task: NLP

Format: MP3

Size: 179.82 MB

NaijaVoices (Lanfrica Labs)

Documenting Ekpeye Folktales and Preserving Cultural Heritage

This dataset presents 21 video-recorded Ekpeye folktales (1h28m) narrated by two community elders, each paired with transcripts and English translations that include narrative summaries. It offers a rich multimodal resource for speech, video, storytelling, and cultural heritage research, as well as training multilingual and multimodal AI systems.

License: CC-BY-NC-SA-4.0

Locale: ekp

Task: OTH

Format: MP4, TXT, DOCX

Size: 5.97 GB

Community

Podcast Hari Minggoean (Indonesia)

This dataset is derived from the "Hari Minggoean" podcast, featuring over ten hours of recorded speech from a single, consistent speaker. The content, tailored for a young Indonesian audience, is presented in Indonesian (Bahasa Indonesia) characterized by code-switching with English and a discernible Javanese accent. The collection is comprised of 42 individual audio files (10+ hours). Sample Tapi dari pelafalan, dari intonasi, dari jedanya dia bicara. That's really good. Dan aku sebenarnya suka banget ketika dia ngomong. Yang btw, soal ekspresif tadi. Aku jadi kepikiran deh.

License: CC-BY-SA-4.0

Locale: id-ID

Task: ASR

Format: mp3

Size: 338.92 MB

Taraaz

Multilingual Humanitarian Response Eval (MHRE)

This multilingual humanitarian dataset contains 655 annotated datapoints evaluating AI chatbot safety and quality in migration and asylum scenarios across four language pairs (English–Farsi (Iranian Persian), Arabic, Kurdish (Sorani), Pashto). Built from 120 expert prompts, it includes outputs from GPT-4o, Gemini 2.5 Flash, and Mistral Small. The dataset provides both human evaluations from Respond Crisis Translation native-speaker evaluators and LLM-as-judge assessments (Gemini 2.5 Flash).

License: CC-BY-NC-SA-4.0

Locale: mul

Task: LLM

Format: csv

Size: 2.15 MB

Open Home Foundation

Anna 1.0

Text to speech dataset for Hungarian, female speaker, approximately 1.5 hours of read speech.

License: CC0-1.0

Locale: hu-HU

Task: TTS

Format: WEBM

Size: 95.27 MB

Institute of African Digital Humanities

Akoose-ALCAM-MultimodalDataset

This dataset comprises a datasheet of Akoose (bss) lexical entries collected from the 'Western-Bakossi' subgroup. Each entry is accompanied by illustrative sentences, French translations and a word-by-word breakdown of the Akoose sentences, as well as an equivalent breakdown in English. The resource is enriched with aligned audio recordings, making it ideal for linguistic analysis and the development of speech technology.

License: NOODL-1.0

Locale: bss

Task: NLP

Format: MP3, TSV

Size: 16.05 MB

OpenCSG

smoltalk-chinese

SmolTalk-Chinese: A multi-task Chinese conversational dataset covering 19 typical dialogue task scenarios.

License: Apache-2.0

Locale: zh

Task: LLM

Format: parquet

Size: 879.81 MB

Open Home Foundation

Lili 1.0

Text to speech dataset for Slovak, female speaker, approximately 2 hours of read speech.

License: CC0-1.0

Locale: sk-SK

Task: TTS

Format: WEBM

Size: 72.38 MB

Chishti Sons

Chishti Sons Punjabi Literature Corpus

This corpus is a collection of more than one million tokens of Western Punjabi language. The data was produced under the Chishti Sons publishing agency. The corpus contains work of literature including short stories, stories, fiction, non-fiction, and other literary books. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.

License: CC-BY-NC-4.0

Locale: pnb

Task: NLP

Format: TXT

Size: 1.65 MB

Common Voice

Common Voice Spontaneous Speech 2.0 - Eastern Min

A collection of spontaneous spoken phrases in Eastern Min.

License: CC0-1.0

Locale: cdo

Task: ASR

Format: MP3

Size: 190.61 MB

Common Voice

Common Voice Scripted Speech 24.0 - Mokpwe

A collection of scripted spoken phrases in Mokpwe.

License: CC0-1.0

Locale: bri

Task: ASR

Format: MP3

Size: 188.52 MB

Common Voice

Common Voice Scripted Speech 24.0 - Ngomba

A collection of scripted spoken phrases in Ngomba.

License: CC0-1.0

Locale: jgo

Task: ASR

Format: MP3

Size: 216.91 MB

Common Voice

Common Voice Scripted Speech 24.0 - Fang

A collection of scripted spoken phrases in Fang.

License: CC0-1.0

Locale: fan

Task: ASR

Format: MP3

Size: 235.94 MB

Common Voice

Common Voice Scripted Speech 24.0 - Ebrie

A collection of scripted spoken phrases in Ebrie.

License: CC0-1.0

Locale: ebr

Task: ASR

Format: MP3

Size: 61.92 MB

Open Home Foundation

Darkman 1.0

Text to speech dataset for Polish, male speaker, approximately 2 hours of read speech.

License: CC0-1.0

Locale: pl-PL

Task: TTS

Format: WEBM

Size: 40.42 MB

Open Home Foundation

Chitwan 1.0

Text to speech dataset for Nepali, male speaker, approximately 1 hour of read speech.

License: CC0-1.0

Locale: ne-NE

Task: TTS

Format: WEBM

Size: 61.68 MB

Common Voice

Common Voice Scripted Speech 24.0 - Slovak

A collection of scripted spoken phrases in Slovak.

License: CC0-1.0

Locale: sk

Task: ASR

Format: MP3

Size: 1.08 GB

Common Voice

Common Voice Scripted Speech 24.0 - Czech

A collection of scripted spoken phrases in Czech.

License: CC0-1.0

Locale: cs

Task: ASR

Format: MP3

Size: 5.54 GB

Balochistan Educational and Cultural Organization

Talar (تلار) Barahui Magazine Corpus

A ~150,000-word Brahui corpus from the monthly magazine Talar, covering editorials, essays, fiction, poetry, and socio-cultural commentary.

License: CC-BY-NC-SA-4.0

Locale: brh

Task: NLP

Format: TXT

Size: 317.22 KB

Common Voice

Common Voice Scripted Speech 24.0 - Basaa

A collection of scripted spoken phrases in Basaa.

License: CC0-1.0

Locale: bas

Task: ASR

Format: MP3

Size: 242.77 MB