Datasets

Filters:
NaijaVoices (Lanfrica Labs)

Future-proofing Gbagyi: A community centered approach

This dataset comprises 360 audio recordings of the Gbagyi language, comprising approximately 7 hours 52 minutes of speech data, with paired transcripts.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: gbr

Task Icon

Task: NLP

Format Icon

Format: WAV

Size Icon

Size: 18.88 GB

Sindh Sujag Newspaper Agency

Sindh Sujag Newspaper Corpus

The corpus contains approximately 1.2 million tokens from the Sindh Sujag Newspaper Agency published between 2024 and 2025. It includes complete newspaper content such as headlines, editorials, finance news, and advertisements. The newspaper is published daily in Karachi, Pakistan, and represents a comprehensive collection of contemporary Sindhi journalistic writing.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: snd

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.63 MB

Aim Foundation

Aim Foundation Dari Literature Corpus

This corpus is a collection of more than seven hundred thousand tokens of Dari language. The corpus contains work of literature including poems, stories, novels, fictional and non-fictional and different articles. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: prs

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.74 MB

Rana Printers Multan

Rana Printers Urdu Literature Corpus

This corpus comprises 1.68 million tokens of high-quality Urdu text collected over the past decade through Rana Printers. It includes a diverse range of literary genres such as stories, short stories, novels, fiction, non-fiction, poetry, and historical works. All content is shared with the authors’ approval. The dataset is intended to support linguistic research, Urdu language technology development, and the preservation of literary and cultural heritage.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: ur

Task Icon

Task: OTH

Format Icon

Format: TXT

Size Icon

Size: 3.00 MB

Anjuman e Katib

Anjuman-e-Katib Farsi/Persian Literature Corpus

This corpus is a collection of more than one million tokens of Farsi/Persian language. The corpus contains work of literature including novels, fictional and non-fictional, poems and many more. the data is shared with the approval of the authors and aims to support linguistic research, language technology development and its preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: fas

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.82 MB

Kaleem Art Press

Kaleem Art Press Urdu Literature Corpus

This corpus is a collection of 1.44 million tokens of Urdu language . The data was produced under the Kaleem Art Press over the last fifteen years . The corpus contains work of literature including Stories, Short Stories, Novels, fiction, non-fiction, Travelogues, Poetry, Biography, and History. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: ur

Task Icon

Task: OTH

Format Icon

Format: TXT

Size Icon

Size: 2.85 MB

Kaleem Art Press

Kaleem Art Press Saraiki Literature Corpus

This corpus contains approximately one million tokens of Saraiki text curated over the past ten years by Kaleem Art Press. It features a wide range of literary genres, including stories, short stories, novels, fiction, non-fiction, travelogues, poetry, biographies, and historical writings. All content is shared with full author approval. The dataset is intended to support linguistic research, Saraiki language technology development, and the preservation of cultural and literary heritage.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: skr

Task Icon

Task: OTH

Format Icon

Format: TXT

Size Icon

Size: 1.84 MB

Keblagh e Azergi

Keblagh-e-Azergi Hazargi literature corpus

This corpus is a collection of more than one hundred thousand tokens of Hazargi language. The corpus contains work of literature, poems, folk and short stories and dramas. The data is being shared with the approval of the authors. It claims to support linguistic research, language technology development and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: haz

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 193.28 KB

NaijaVoices (Lanfrica Labs)

Documenting Ekpeye Folktales and Preserving Cultural Heritage

This dataset presents 21 video-recorded Ekpeye folktales (1h28m) narrated by two community elders, each paired with transcripts and English translations that include narrative summaries. It offers a rich multimodal resource for speech, video, storytelling, and cultural heritage research, as well as training multilingual and multimodal AI systems.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: ekp

Task Icon

Task: OTH

Format Icon

Format: MP4, TXT, DOCX

Size Icon

Size: 5.97 GB

NaijaVoices (Lanfrica Labs)

Atyap Afwan_: Preserving Tyap Through Community-Driven Speech Data

This dataset contains 98 recordings (≈1.16 hours) of everyday Tyap speech from 10 community speakers, each paired with detailed transcripts and English translations.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: kcg

Task Icon

Task: NLP

Format Icon

Format: WAV, TXT

Size Icon

Size: 251.51 MB

Institute of African Digital Humanities

Basaa-ALCAM-MultimodalDataset

This dataset comprises a datasheet of lexical entries in Basaa, accompanied by illustrative sentences, word-by-word glosses, and corresponding translations in French. Each entry is enriched with aligned audio recordings, making the resource suitable for linguistic analysis and speech technology development.
License Icon

License: NOODL-1.0

Locale Icon

Locale: bas

Task Icon

Task: NLP

Format Icon

Format: MP3, TSV

Size Icon

Size: 14.66 MB

Common Voice

Mozilla Common Voice Spontaneous Speech ASR Shared Task Test Data

A bundle of the held-out test data for the Mozilla Common Voice Spontaneous Speech ASR shared task.
License Icon

License: CC0-1.0

Locale Icon

Locale: mul

Task Icon

Task: ASR

Format Icon

Format: MP3, TSV

Size Icon

Size: 784.80 MB

NaijaVoices (Lanfrica Labs)

Everyday Interactions in Ibọnọ and Obolo Languages

This dataset offers 11.3 hours of natural everyday speech in Ibọnọ and Obolo, captured from 20 adult speakers across 120 recordings, each paired with a clean transcript and metadata.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: ibn, ann

Task Icon

Task: NLP

Format Icon

Format: WAV, TXT

Size Icon

Size: 2.43 GB

TidyVoice2026 Challenge

TidyVoiceX_ASV

This dataset is designed for speaker verification using the Mozilla Common Voice corpus across 40 languages. It includes approximately 5,000 speakers who each have recordings in more than one language. Leveraging this multilingual overlap, we construct the trial pairs to explore cross-lingual variation in the speaker verification task.
License Icon

License: CC0-1.0

Locale Icon

Locale: mul

Task Icon

Task: OTH

Format Icon

Format: WAV

Size Icon

Size: 36.72 GB

NaijaVoices (Lanfrica Labs)

Ehugbo TTS: biblical text to speech dataset in Ehugbo Language

This dataset contains audio recordings of Bible verses in Ehugbo, a dialect of Igbo (a Niger-Congo language spoken in Nigeria). It contains 312 audio recordings of biblical text-to-speech data comprising 1 hour and 30 seconds of speech data.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: ig-ehugbo

Task Icon

Task: TTS

Format Icon

Format: WAV

Size Icon

Size: 437.69 MB

NaijaVoices (Lanfrica Labs)

Speech Data Collection for The Nupe Language

This dataset contains audio recordings of the Nupe language. It features 1,583 audio recordings comprising 2 hours, 40 minutes, and 32 seconds of speech data, with paired transcripts. The recordings feature 8 unique speakers representing three distinct Nupe accent varieties: Bida accent, Kutigi accent, and Lapai accent.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: nup

Task Icon

Task: NLP

Format Icon

Format: WAV, TXT

Size Icon

Size: 1.58 GB

Akylai

KyrgyzNER: Human-Annotated NER Dataset for Kyrgyz

KyrgyzNER is the first manually annotated Named-Entity Recognition (NER) dataset for the Kyrgyz language. It consists of 1,499 news articles (10,900 sentences, 140k tokens) from the 24.kg news portal, annotated with 39,075 entity mentions across 27 classes.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: ky

Task Icon

Task: NLP

Format Icon

Format: CONLL-2003

Size Icon

Size: 585.87 KB

Pro Svizra Rumantscha

Putèr Newspaper Corpus

1.3 million tokens in the Putèr variety of Romansh from the daily newspaper “La Quotidiana”.
License Icon

License: CC0-1.0

Locale Icon

Locale: rm-puter

Task Icon

Task: OTH

Format Icon

Format: TSV

Size Icon

Size: 8.94 MB

Pro Svizra Rumantscha

Sutsilvan Newspaper Corpus

1.3 million tokens in the Sutsilvan variety of Romansh from the daily newspaper “La Quotidiana”.
License Icon

License: CC0-1.0

Locale Icon

Locale: rm-sutsilv

Task Icon

Task: OTH

Format Icon

Format: TSV

Size Icon

Size: 8.87 MB

Pro Svizra Rumantscha

Sursilvan Newspaper Corpus

14.6 million tokens in the Sursilvan variety of Romansh from the daily newspaper “La Quotidiana”.
License Icon

License: CC0-1.0

Locale Icon

Locale: rm-sursilv

Task Icon

Task: OTH

Format Icon

Format: TSV

Size Icon

Size: 37.80 MB

Pro Svizra Rumantscha

Rumantsch Grischun Newspaper Corpus

6.1 million tokens in the Rumantsch Grischun variety of Romansh from the daily newspaper “La Quotidiana”.
License Icon

License: CC0-1.0

Locale Icon

Locale: rm-rumgr

Task Icon

Task: OTH

Format Icon

Format: TSV

Size Icon

Size: 19.03 MB

Community

Podcast Homostoria (Indonesia)

This dataset features discussions on modern media—including film, podcasts, and social media—and its connection to local customs and traditions. The conversations are primarily in Indonesian, with frequent code-switching between English and Javanese.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: id

Task Icon

Task: ASR

Format Icon

Format: mp3

Size Icon

Size: 302.97 MB

Open Home Foundation

Imre 1.0

Text to speech dataset for Hungarian, male speaker, approximately 1.5 hours of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: hu-HU

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 99.60 MB

Open Home Foundation

Berta 1.0

Text to speech dataset for Hungarian, female speaker, approximately 1 hour of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: hu-HU

Task Icon

Task: TTS

Format Icon

Format: FLAC

Size Icon

Size: 209.52 MB