Datasets

Filters:
NaijaVoices (Lanfrica Labs)

Ehugbo TTS: biblical text to speech dataset in Ehugbo Language

This dataset contains audio recordings of Bible verses in Ehugbo, a dialect of Igbo (a Niger-Congo language spoken in Nigeria). It contains 312 audio recordings of biblical text-to-speech data comprising 1 hour and 30 seconds of speech data.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: ig-ehugbo

Task Icon

Task: TTS

Format Icon

Format: WAV

Size Icon

Size: 437.69 MB

NaijaVoices (Lanfrica Labs)

Speech Data Collection for The Nupe Language

This dataset contains audio recordings of the Nupe language. It features 1,583 audio recordings comprising 2 hours, 40 minutes, and 32 seconds of speech data, with paired transcripts. The recordings feature 8 unique speakers representing three distinct Nupe accent varieties: Bida accent, Kutigi accent, and Lapai accent.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: nup

Task Icon

Task: NLP

Format Icon

Format: WAV, TXT

Size Icon

Size: 1.58 GB

Akylai

KyrgyzNER: Human-Annotated NER Dataset for Kyrgyz

KyrgyzNER is the first manually annotated Named-Entity Recognition (NER) dataset for the Kyrgyz language. It consists of 1,499 news articles (10,900 sentences, 140k tokens) from the 24.kg news portal, annotated with 39,075 entity mentions across 27 classes.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: ky

Task Icon

Task: NLP

Format Icon

Format: CONLL-2003

Size Icon

Size: 585.87 KB

Pro Svizra Rumantscha

Putèr Newspaper Corpus

1.3 million tokens in the Putèr variety of Romansh from the daily newspaper “La Quotidiana”.
License Icon

License: CC0-1.0

Locale Icon

Locale: rm-puter

Task Icon

Task: OTH

Format Icon

Format: TSV

Size Icon

Size: 8.94 MB

Pro Svizra Rumantscha

Sutsilvan Newspaper Corpus

1.3 million tokens in the Sutsilvan variety of Romansh from the daily newspaper “La Quotidiana”.
License Icon

License: CC0-1.0

Locale Icon

Locale: rm-sutsilv

Task Icon

Task: OTH

Format Icon

Format: TSV

Size Icon

Size: 8.87 MB

Pro Svizra Rumantscha

Sursilvan Newspaper Corpus

14.6 million tokens in the Sursilvan variety of Romansh from the daily newspaper “La Quotidiana”.
License Icon

License: CC0-1.0

Locale Icon

Locale: rm-sursilv

Task Icon

Task: OTH

Format Icon

Format: TSV

Size Icon

Size: 37.80 MB

Pro Svizra Rumantscha

Rumantsch Grischun Newspaper Corpus

6.1 million tokens in the Rumantsch Grischun variety of Romansh from the daily newspaper “La Quotidiana”.
License Icon

License: CC0-1.0

Locale Icon

Locale: rm-rumgr

Task Icon

Task: OTH

Format Icon

Format: TSV

Size Icon

Size: 19.03 MB

Community

Podcast Homostoria (Indonesia)

This dataset features discussions on modern media—including film, podcasts, and social media—and its connection to local customs and traditions. The conversations are primarily in Indonesian, with frequent code-switching between English and Javanese.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: id

Task Icon

Task: ASR

Format Icon

Format: mp3

Size Icon

Size: 302.97 MB

Open Home Foundation

Imre 1.0

Text to speech dataset for Hungarian, male speaker, approximately 1.5 hours of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: hu-HU

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 99.60 MB

Open Home Foundation

Berta 1.0

Text to speech dataset for Hungarian, female speaker, approximately 1 hour of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: hu-HU

Task Icon

Task: TTS

Format Icon

Format: FLAC

Size Icon

Size: 209.52 MB

Open Home Foundation

Anna 1.0

Text to speech dataset for Hungarian, female speaker, approximately 1.5 hours of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: hu-HU

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 95.27 MB

Open Home Foundation

Dave 1.0

Text to speech dataset for Spanish, male speaker, approximately 1.5 hours of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: es-ES

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 85.24 MB

Open Home Foundation

Kathleen 1.0

Text to speech dataset for English, female speaker, approximately 1 hour of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: en-US

Task Icon

Task: TTS

Format Icon

Format: FLAC

Size Icon

Size: 211.96 MB

Open Home Foundation

Joe 1.0

Text to speech dataset for English, male speaker, approximately 1 hour of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: en-US

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 75.78 MB

Open Home Foundation

Kerstin 1.0

Text to speech dataset for German, female speaker, approximately 2 hours of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: de-DE

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 132.05 MB

Rerooted Archive

ReRooted: Speech Corpus of Testimonials from Armenian Refugees and Immigrants

ReRooted is an open-access online YouTube corpus of interviews with Armenian refugees and immigrants. As of now, the online corpus has over 80hrs of recordings on YouTube, alongside subtitles in Armenian from Amara. The current dataset is 10 hours from the corpus that we have curated with corrected and finely-annotated transcripts. The dataset is for use in NLP research, and we hope to continuously update the dataset with more curated transcripts.
License Icon

License: GPL-3.0

Locale Icon

Locale: hy

Task Icon

Task: ASR

Format Icon

Format: WAV, TEXTGRID

Size Icon

Size: 3.25 GB

Weekly Kaleem Magazine Multan

Kaleem Magazine Urdu Corpus

This corpus is a collection of around 1.4 million tokens of Urdu language. The data was extracted from the archives of a famous Urdu magazine "Kaleem" published weekly from last 30 years. This corpus contains work of literature including stories, short stories, news, poetry, literary reports, fiction, non-fiction, and travelogues. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: urd

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.74 MB

Baloch Publishers Multan

Baloch Publishers Saraiki Literature Corpus

This corpus is a collection of one million tokens of Saraiki language. The data was produced under the Baloch Publishers over the last ten years. The corpus contains work of literature including dictionaries, short stories, novels, fiction, non-fiction, and travelogue. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: skr

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.04 MB

Chishti Sons

Chishti Sons Punjabi Literature Corpus

This corpus is a collection of more than one million tokens of Western Punjabi language. The data was produced under the Chishti Sons publishing agency. The corpus contains work of literature including short stories, stories, fiction, non-fiction, and other literary books. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: pnb

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.65 MB

Institute of African Digital Humanities

FUB-Narratives

This dataset contains literary texts derived from oral Fulfulde Adamawa (fub) performances. The texts are of various genres, including narratives, hymns, riddles and poems. The performances were recorded on tape and then transcribed with the help of local Adamawa Fulfulde language experts.
License Icon

License: NOODL-1.0

Locale Icon

Locale: fub

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 168.34 KB

Jazab Publishers

Jazab Sindhi Newspaper Corpus

The corpus contains 1.07 million tokens from the Jazab a Sindhi Newspaper published from the year 2023-2025. The text consists of the complete newspaper content including headlines, editorials, finance news and advertisements. The newspaper published in Karachi, Pakistan on daily basis.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: snd

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.33 MB

Tamir News Agency

Tamir Sindhi News Corpus

The corpus contains 1.1 million tokens from the Tamir Sindhi Newspaper published from the year 2022-2025. The text consists of the complete newspaper content including headlines, editorials, finance news and advertisements. The newspaper published in Karachi, Pakistan on daily basis.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: snd

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.56 MB

MEDIAMEN

Mediamen Punjabi Literature Corpus

This corpus is a collection of one million tokens of Western Punjabi language. The data was produced under the Mediamen publishing agency over the last ten years. The corpus contains work of literature including short stories, novels, fiction, non-fiction, and dramas. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: pnb

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.82 MB

Community

Speech Corpus of Armenian Question-Answer Dialogues

A collection of question-answer dialogues in Western and Eastern Armenian.
License Icon

License: GPL-3.0

Locale Icon

Locale: hy

Task Icon

Task: ASR

Format Icon

Format: WAV, TEXTGRID, TXT

Size Icon

Size: 2.10 GB