Datasets

Institute of African Digital Humanities

Ewondo_Fong_ALCAM-MultimodalDataset

A multimodal linguistic resource comprising a curated datasheet of example sentences in Ewondo (Fong variety) and their French equivalents, along with their corresponding audio recordings and a sentence–audio alignment file. It is designed to support research, documentation and pedagogy in the field of speech and language technology for under-resourced African languages.
License Icon

License: NOODL-1.0

Locale Icon

Locale: ewo

Task Icon

Task: NLP

Format Icon

Format: MP3, TSV

Size Icon

Size: 16.80 MB

Digital Divide Data

Luhya ASR data subset 70 hours

A 70-hour subset of Luhya speech data collected by Digital Divide Data in Kenya. The dataset includes recorded sentences from native speakers and is intended to support research and development in Automatic Speech Recognition for low-resource African languages.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: luy

Task Icon

Task: ASR

Format Icon

Format: WAV, XLSX

Size Icon

Size: 13.90 GB

MEDIAMEN

English–Punjabi (Shahmukhi) Parallel Sentences Corpus (Mediamen Archives)

This parallel sentences corpus containing 30,405 aligned sentence pairs with a total of approximately 0.62 million tokens, curated from the archival materials of Mediamen (Advertising Agency). The sentences were professionally translated from English into Punjabi (Shahmukhi) and are intended to support machine translation, linguistic research, and Punjabi language technology development, particularly for real-world and contemporary language use.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: en-PK, pnb

Task Icon

Task: MT

Format Icon

Format: CSV

Size Icon

Size: 1.08 MB

ComparIA

Compar:IA conversations

French conversational AI conversation and preference dataset with 396K conversations from 50+ LLMs
License Icon

License: Etalab 2.0

Locale Icon

Locale: fr

Task Icon

Task: NLG

Format Icon

Format: PARQUET

Size Icon

Size: 1.81 GB

Institute of African Digital Humanities

Adamawa Fulfulde - French Parallel Corpus of Narratives 1.0

Version 1.0 of the Adamawa Fulfulde–French Parallel Corpus of Narratives comprises 1,977 lines of Adamawa Fulfulde narratives and their French translations.
License Icon

License: NOODL-1.0

Locale Icon

Locale: fub

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 112.50 KB

OpenCSG

smoltalk-chinese

SmolTalk-Chinese: A multi-task Chinese conversational dataset covering 19 typical dialogue task scenarios.
License Icon

License: Apache-2.0

Locale Icon

Locale: zh

Task Icon

Task: LLM

Format Icon

Format: parquet

Size Icon

Size: 879.81 MB

MEDIAMEN

Mediamen Punjabi Literature Corpus

This corpus is a collection of one million tokens of Western Punjabi language. The data was produced under the Mediamen publishing agency over the last ten years. The corpus contains work of literature including short stories, novels, fiction, non-fiction, and dramas. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: pnb

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.82 MB

Chishti Sons

Chishti Sons Punjabi Literature Corpus

This corpus is a collection of more than one million tokens of Western Punjabi language. The data was produced under the Chishti Sons publishing agency. The corpus contains work of literature including short stories, stories, fiction, non-fiction, and other literary books. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: pnb

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.65 MB

Baloch Publishers Multan

Baloch Publishers Saraiki Literature Corpus

This corpus is a collection of one million tokens of Saraiki language. The data was produced under the Baloch Publishers over the last ten years. The corpus contains work of literature including dictionaries, short stories, novels, fiction, non-fiction, and travelogue. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: skr

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.04 MB

Weekly Kaleem Magazine Multan

Kaleem Magazine Urdu Corpus

This corpus is a collection of around 1.4 million tokens of Urdu language. The data was extracted from the archives of a famous Urdu magazine "Kaleem" published weekly from last 30 years. This corpus contains work of literature including stories, short stories, news, poetry, literary reports, fiction, non-fiction, and travelogues. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: urd

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.74 MB

Kaleem Art Press

Kaleem Art Press Urdu Literature Corpus

This corpus is a collection of 1.44 million tokens of Urdu language . The data was produced under the Kaleem Art Press over the last fifteen years . The corpus contains work of literature including Stories, Short Stories, Novels, fiction, non-fiction, Travelogues, Poetry, Biography, and History. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: ur

Task Icon

Task: OTH

Format Icon

Format: TXT

Size Icon

Size: 2.85 MB

Rana Printers Multan

Rana Printers Urdu Literature Corpus

This corpus comprises 1.68 million tokens of high-quality Urdu text collected over the past decade through Rana Printers. It includes a diverse range of literary genres such as stories, short stories, novels, fiction, non-fiction, poetry, and historical works. All content is shared with the authors’ approval. The dataset is intended to support linguistic research, Urdu language technology development, and the preservation of literary and cultural heritage.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: ur

Task Icon

Task: OTH

Format Icon

Format: TXT

Size Icon

Size: 3.00 MB

Kaleem Art Press

Multilingual Religious Parallel Corpus (Kaleem Art Press)

This dataset is a multilingual parallel sentences corpus containing 6,465 aligned sentence units with approximately 0.98 million words, curated from Kaleem Art Press archives. It includes parallel religious text data in Arabic, Urdu, Saraiki (standard and dialectal), Punjabi (Shahmukhi), and English, supporting research in machine translation, comparative linguistics, digital humanities, and low-resource language studies.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: mul

Task Icon

Task: MT

Format Icon

Format: CSV

Size Icon

Size: 2.27 MB

Keblagh e Azergi

Keblagh-e-Azergi Hazargi literature corpus

This corpus is a collection of more than one hundred thousand tokens of Hazargi language. The corpus contains work of literature, poems, folk and short stories and dramas. The data is being shared with the approval of the authors. It claims to support linguistic research, language technology development and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: haz

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 193.28 KB

Kaleem Art Press

Kaleem Art Press Saraiki Literature Corpus

This corpus contains approximately one million tokens of Saraiki text curated over the past ten years by Kaleem Art Press. It features a wide range of literary genres, including stories, short stories, novels, fiction, non-fiction, travelogues, poetry, biographies, and historical writings. All content is shared with full author approval. The dataset is intended to support linguistic research, Saraiki language technology development, and the preservation of cultural and literary heritage.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: skr

Task Icon

Task: OTH

Format Icon

Format: TXT

Size Icon

Size: 1.84 MB

Anjuman e Katib

Anjuman-e-Katib Farsi/Persian Literature Corpus

This corpus is a collection of more than one million tokens of Farsi/Persian language. The corpus contains work of literature including novels, fictional and non-fictional, poems and many more. the data is shared with the approval of the authors and aims to support linguistic research, language technology development and its preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: fas

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.82 MB

Aim Foundation

Aim Foundation Dari Literature Corpus

This corpus is a collection of more than seven hundred thousand tokens of Dari language. The corpus contains work of literature including poems, stories, novels, fictional and non-fictional and different articles. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: prs

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.74 MB