Datasets

Filters:

Open Home Foundation

Anna 1.0

Text to speech dataset for Hungarian, female speaker, approximately 1.5 hours of read speech.

License: CC0-1.0

Locale: hu-HU

Task: TTS

Format: WEBM

Size: 95.27 MB

Open Home Foundation

Dave 1.0

Text to speech dataset for Spanish, male speaker, approximately 1.5 hours of read speech.

License: CC0-1.0

Locale: es-ES

Task: TTS

Format: WEBM

Size: 85.24 MB

Open Home Foundation

Kathleen 1.0

Text to speech dataset for English, female speaker, approximately 1 hour of read speech.

License: CC0-1.0

Locale: en-US

Task: TTS

Format: FLAC

Size: 211.96 MB

Open Home Foundation

Joe 1.0

Text to speech dataset for English, male speaker, approximately 1 hour of read speech.

License: CC0-1.0

Locale: en-US

Task: TTS

Format: WEBM

Size: 75.78 MB

Open Home Foundation

Kerstin 1.0

Text to speech dataset for German, female speaker, approximately 2 hours of read speech.

License: CC0-1.0

Locale: de-DE

Task: TTS

Format: WEBM

Size: 132.05 MB

Rerooted Archive

ReRooted: Speech Corpus of Testimonials from Armenian Refugees and Immigrants

ReRooted is an open-access online YouTube corpus of interviews with Armenian refugees and immigrants. As of now, the online corpus has over 80hrs of recordings on YouTube, alongside subtitles in Armenian from Amara. The current dataset is 10 hours from the corpus that we have curated with corrected and finely-annotated transcripts. The dataset is for use in NLP research, and we hope to continuously update the dataset with more curated transcripts.

License: GPL-3.0

Locale: hy

Task: ASR

Format: WAV, TEXTGRID

Size: 3.25 GB

Weekly Kaleem Magazine Multan

Kaleem Magazine Urdu Corpus

This corpus is a collection of around 1.4 million tokens of Urdu language. The data was extracted from the archives of a famous Urdu magazine "Kaleem" published weekly from last 30 years. This corpus contains work of literature including stories, short stories, news, poetry, literary reports, fiction, non-fiction, and travelogues. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.

License: CC-BY-NC-4.0

Locale: urd

Task: NLP

Format: TXT

Size: 2.74 MB

Baloch Publishers Multan

Baloch Publishers Saraiki Literature Corpus

This corpus is a collection of one million tokens of Saraiki language. The data was produced under the Baloch Publishers over the last ten years. The corpus contains work of literature including dictionaries, short stories, novels, fiction, non-fiction, and travelogue. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.

License: CC-BY-NC-4.0

Locale: skr

Task: NLP

Format: TXT

Size: 2.04 MB

Chishti Sons

Chishti Sons Punjabi Literature Corpus

This corpus is a collection of more than one million tokens of Western Punjabi language. The data was produced under the Chishti Sons publishing agency. The corpus contains work of literature including short stories, stories, fiction, non-fiction, and other literary books. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.

License: CC-BY-NC-4.0

Locale: pnb

Task: NLP

Format: TXT

Size: 1.65 MB

Institute of African Digital Humanities

FUB-Narratives

This dataset contains literary texts derived from oral Fulfulde Adamawa (fub) performances. The texts are of various genres, including narratives, hymns, riddles and poems. The performances were recorded on tape and then transcribed with the help of local Adamawa Fulfulde language experts.

License: NOODL-1.0

Locale: fub

Task: NLP

Format: TXT

Size: 168.34 KB

Jazab Publishers

Jazab Sindhi Newspaper Corpus

The corpus contains 1.07 million tokens from the Jazab a Sindhi Newspaper published from the year 2023-2025. The text consists of the complete newspaper content including headlines, editorials, finance news and advertisements. The newspaper published in Karachi, Pakistan on daily basis.

License: CC-BY-NC-SA-4.0

Locale: snd

Task: NLP

Format: TXT

Size: 2.33 MB

Tamir News Agency

Tamir Sindhi News Corpus

The corpus contains 1.1 million tokens from the Tamir Sindhi Newspaper published from the year 2022-2025. The text consists of the complete newspaper content including headlines, editorials, finance news and advertisements. The newspaper published in Karachi, Pakistan on daily basis.

License: CC-BY-NC-SA-4.0

Locale: snd

Task: NLP

Format: TXT

Size: 2.56 MB

MEDIAMEN

Mediamen Punjabi Literature Corpus

This corpus is a collection of one million tokens of Western Punjabi language. The data was produced under the Mediamen publishing agency over the last ten years. The corpus contains work of literature including short stories, novels, fiction, non-fiction, and dramas. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.

License: CC-BY-NC-4.0

Locale: pnb

Task: NLP

Format: TXT

Size: 1.82 MB

Community

Speech Corpus of Armenian Question-Answer Dialogues

A collection of question-answer dialogues in Western and Eastern Armenian.

License: GPL-3.0

Locale: hy

Task: ASR

Format: WAV, TEXTGRID, TXT

Size: 2.10 GB

Institute of African Digital Humanities

Ewondo-French Parallel Corpus

This dataset is a parallel corpus of Ewondo to French texts. Text were obtained by transcription of raw audio files. Translation were added to enrich the original corpus. Alignment of Ewondo and French texts were made in the process of creating this dataset.

License: NOODL-1.0

Locale: ewo, fr

Task: MT

Format: TSV

Size: 137.84 KB

Open Home Foundation

Dimitar 1.0

Text to speech dataset for Bulgarian, male speaker, approximately 2 hours of read speech.

License: CC0-1.0

Locale: bg-BG

Task: TTS

Format: WEBM

Size: 109.58 MB

Tamahi Suneha Magazine

Punjabi Literature Corpus

This corpus contains 10,39,430 tokens of Punjabi Shahmukhi script.

License: CC-BY-NC-4.0

Locale: pa-PK

Task: OTH

Format: TXT

Size: 1.83 MB

Sujaak Adbi Sangat

Saraiki Quarterly Magazine Wasson Wehray Corpus

This corpus contains 11,79,200 tokens, from a Saraiki Quarterly Magazine "Wasson Wehray"

License: CC-BY-NC-4.0

Locale: skr

Task: OTH

Format: TXT

Size: 2.09 MB

Bismillah Graphics Publishers

Urdu Literature Corpus

This corpus contains 16,17,074 tokens of multiple Urdu literature books published by Bismillah Graphics Publishers.

License: CC-BY-NC-4.0

Locale: ur

Task: OTH

Format: TXT

Size: 2.86 MB

Kaleem Art Press

Saraiki Literature Corpus

This contains multiple Saraiki Language books of Stories, Short Stories, Novel, Travelogue, Sentences and collection of articles.

License: CC-BY-NC-4.0

Locale: skr

Task: OTH

Format: TXT

Size: 1.84 MB

Community

Podcast Hari Minggoean (Indonesia)

This dataset is derived from the "Hari Minggoean" podcast, featuring over ten hours of recorded speech from a single, consistent speaker. The content, tailored for a young Indonesian audience, is presented in Indonesian (Bahasa Indonesia) characterized by code-switching with English and a discernible Javanese accent. The collection is comprised of 42 individual audio files (10+ hours). Sample Tapi dari pelafalan, dari intonasi, dari jedanya dia bicara. That's really good. Dan aku sebenarnya suka banget ketika dia ngomong. Yang btw, soal ekspresif tadi. Aku jadi kepikiran deh.

License: CC-BY-SA-4.0

Locale: id-ID

Task: ASR

Format: mp3

Size: 338.92 MB

Kaltepetlahtol

Tetelancingo Nahuatl

Audio del Nahuatl de la Sierra Oeste de Puebla, transcrito, ortográficamente normalizado, traducido, y etiquetado

License: CC-BY-NC-4.0

Locale: nhi

Task: ASR

Format: .tsv, .wav

Size: 952.98 MB

Common Voice

Mozilla Common Voice Spontaneous Speech ASR Shared Task Train/Dev Data

This datasheet is for the bundle of Mozilla Common Voice spontaneous speech datasets to be used in the Shared Task on Spontaneous Speech.

License: CC0-1.0

Locale: mul

Task: ASR

Format: mp3

Size: 4.30 GB