Datasets

Filters:
Search results for “english”
Community

Bojonegoro Javanese TTS

This dataset contains synthetic speech data covering a variety of everyday-life topics in the Bojonegoro dialect of Javanese, spoken in East Java, Indonesia.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: TTS

Format Icon

Format: .tar.gz, WEBM

Size Icon

Size: 469.50 MB

Common Voice

Common Voice Spontaneous Speech 2.0 - Thur

A collection of spontaneous spoken phrases in Thur.
License Icon

License: CC0-1.0

Locale Icon

Locale: lth

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 292.98 MB

Common Voice

Common Voice Scripted Speech 24.0 - Greek

A collection of scripted spoken phrases in Greek.
License Icon

License: CC0-1.0

Locale Icon

Locale: el

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 741.82 MB

Open Home Foundation

Tugão 1.0

Text to speech dataset for Portuguese, male speaker, approximately 1.5 hours of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: pt-PT

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 61.84 MB

Amara Hub

DataTrust Africa: Speech Corpus of Public Radio Recordings from Northern Uganda

This is an open-access corpus of short clips of public radio content from Mega 100 FM, Q FM, Radio Pacis and Radio Rupiny in Northern Uganda. As of now, the online corpus has over 350 clips of recordings in English. We also hope to add finely-annotated transcripts to them. The dataset is for use in NLP research and non-commercial use. Upcoming datasets to look out for from Amara Hub are public radio recordings in other languages spoken in the region like Acholi, Lango, Lugbara and Akaramajong.
License Icon

License: NOODL-1.0

Locale Icon

Locale: en-US

Task Icon

Task: NLP

Format Icon

Format: MP3

Size Icon

Size: 179.82 MB

NaijaVoices (Lanfrica Labs)

Documenting Ekpeye Folktales and Preserving Cultural Heritage

This dataset presents 21 video-recorded Ekpeye folktales (1h28m) narrated by two community elders, each paired with transcripts and English translations that include narrative summaries. It offers a rich multimodal resource for speech, video, storytelling, and cultural heritage research, as well as training multilingual and multimodal AI systems.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: ekp

Task Icon

Task: OTH

Format Icon

Format: MP4, TXT, DOCX

Size Icon

Size: 5.97 GB

Community

Podcast Hari Minggoean (Indonesia)

This dataset is derived from the "Hari Minggoean" podcast, featuring over ten hours of recorded speech from a single, consistent speaker. The content, tailored for a young Indonesian audience, is presented in Indonesian (Bahasa Indonesia) characterized by code-switching with English and a discernible Javanese accent. The collection is comprised of 42 individual audio files (10+ hours). Sample Tapi dari pelafalan, dari intonasi, dari jedanya dia bicara. That's really good. Dan aku sebenarnya suka banget ketika dia ngomong. Yang btw, soal ekspresif tadi. Aku jadi kepikiran deh.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: id-ID

Task Icon

Task: ASR

Format Icon

Format: mp3

Size Icon

Size: 338.92 MB

Taraaz

Multilingual Humanitarian Response Eval (MHRE)

This multilingual humanitarian dataset contains 655 annotated datapoints evaluating AI chatbot safety and quality in migration and asylum scenarios across four language pairs (English–Farsi (Iranian Persian), Arabic, Kurdish (Sorani), Pashto). Built from 120 expert prompts, it includes outputs from GPT-4o, Gemini 2.5 Flash, and Mistral Small. The dataset provides both human evaluations from Respond Crisis Translation native-speaker evaluators and LLM-as-judge assessments (Gemini 2.5 Flash).
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: mul

Task Icon

Task: LLM

Format Icon

Format: csv

Size Icon

Size: 2.15 MB

Open Home Foundation

Anna 1.0

Text to speech dataset for Hungarian, female speaker, approximately 1.5 hours of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: hu-HU

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 95.27 MB

Institute of African Digital Humanities

Akoose-ALCAM-MultimodalDataset

This dataset comprises a datasheet of Akoose (bss) lexical entries collected from the 'Western-Bakossi' subgroup. Each entry is accompanied by illustrative sentences, French translations and a word-by-word breakdown of the Akoose sentences, as well as an equivalent breakdown in English. The resource is enriched with aligned audio recordings, making it ideal for linguistic analysis and the development of speech technology.
License Icon

License: NOODL-1.0

Locale Icon

Locale: bss

Task Icon

Task: NLP

Format Icon

Format: MP3, TSV

Size Icon

Size: 16.05 MB

OpenCSG

smoltalk-chinese

SmolTalk-Chinese: A multi-task Chinese conversational dataset covering 19 typical dialogue task scenarios.
License Icon

License: Apache-2.0

Locale Icon

Locale: zh

Task Icon

Task: LLM

Format Icon

Format: parquet

Size Icon

Size: 879.81 MB

Open Home Foundation

Lili 1.0

Text to speech dataset for Slovak, female speaker, approximately 2 hours of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: sk-SK

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 72.38 MB

Chishti Sons

Chishti Sons Punjabi Literature Corpus

This corpus is a collection of more than one million tokens of Western Punjabi language. The data was produced under the Chishti Sons publishing agency. The corpus contains work of literature including short stories, stories, fiction, non-fiction, and other literary books. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: pnb

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.65 MB

Common Voice

Common Voice Spontaneous Speech 2.0 - Eastern Min

A collection of spontaneous spoken phrases in Eastern Min.
License Icon

License: CC0-1.0

Locale Icon

Locale: cdo

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 190.61 MB

Common Voice

Common Voice Scripted Speech 24.0 - Mokpwe

A collection of scripted spoken phrases in Mokpwe.
License Icon

License: CC0-1.0

Locale Icon

Locale: bri

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 188.52 MB

Common Voice

Common Voice Scripted Speech 24.0 - Ngomba

A collection of scripted spoken phrases in Ngomba.
License Icon

License: CC0-1.0

Locale Icon

Locale: jgo

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 216.91 MB

Common Voice

Common Voice Scripted Speech 24.0 - Fang

A collection of scripted spoken phrases in Fang.
License Icon

License: CC0-1.0

Locale Icon

Locale: fan

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 235.94 MB

Common Voice

Common Voice Scripted Speech 24.0 - Ebrie

A collection of scripted spoken phrases in Ebrie.
License Icon

License: CC0-1.0

Locale Icon

Locale: ebr

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 61.92 MB

Open Home Foundation

Darkman 1.0

Text to speech dataset for Polish, male speaker, approximately 2 hours of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: pl-PL

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 40.42 MB

Open Home Foundation

Chitwan 1.0

Text to speech dataset for Nepali, male speaker, approximately 1 hour of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: ne-NE

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 61.68 MB

Common Voice

Common Voice Scripted Speech 24.0 - Slovak

A collection of scripted spoken phrases in Slovak.
License Icon

License: CC0-1.0

Locale Icon

Locale: sk

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 1.08 GB

Common Voice

Common Voice Scripted Speech 24.0 - Czech

A collection of scripted spoken phrases in Czech.
License Icon

License: CC0-1.0

Locale Icon

Locale: cs

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 5.54 GB

Balochistan Educational and Cultural Organization

Talar (تلار) Barahui Magazine Corpus

A ~150,000-word Brahui corpus from the monthly magazine Talar, covering editorials, essays, fiction, poetry, and socio-cultural commentary.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: brh

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 317.22 KB

Common Voice

Common Voice Scripted Speech 24.0 - Basaa

A collection of scripted spoken phrases in Basaa.
License Icon

License: CC0-1.0

Locale Icon

Locale: bas

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 242.77 MB