Datasets

Search results for “support”
Institute of African Digital Humanities

Ewondo_Fong_ALCAM-MultimodalDataset

A multimodal linguistic resource comprising a curated datasheet of example sentences in Ewondo (Fong variety) and their French equivalents, along with their corresponding audio recordings and a sentence–audio alignment file. It is designed to support research, documentation and pedagogy in the field of speech and language technology for under-resourced African languages.
License Icon

License: NOODL-1.0

Locale Icon

Locale: ewo

Task Icon

Task: NLP

Format Icon

Format: MP3, TSV

Size Icon

Size: 16.80 MB

Collaborative Action For Research & Development (CARD)

Gawri (گاؤری) Magazine Corpus

The Gawri (گاؤری) Magazine Corpus is a curated collection of monthly Gawri magazine text (~67,724 tokens) that supports research and language technology.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: gwc

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 146.71 KB

Institute of African Digital Humanities

Adamawa Fulfulde-French Parallel Corpus of Narratives 1.2

This dataset is an updated version of the 'Adamawa Fulfulde–French Parallel Corpus of Narratives 1.1'.
License Icon

License: NOODL-1.0

Locale Icon

Locale: fub

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 112.17 KB

ComparIA

Compar:IA conversations

French conversational AI conversation and preference dataset with 396K conversations from 50+ LLMs
License Icon

License: Etalab 2.0

Locale Icon

Locale: fr

Task Icon

Task: NLG

Format Icon

Format: PARQUET

Size Icon

Size: 1.81 GB

Universitas Gadjah Mada

Jember Javanese Spontaneous Speech Corpus

A 10-hour spoken dataset of native Javanese speakers from Jember, East Java, representing the Jember dialect and Pandhalungan variety in natural speech.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: ASR

Format Icon

Format: MP3, TSV

Size Icon

Size: 271.65 MB

Balochistan Educational and Cultural Organization

Western Balochi Literature Cropus

A cleaned literary corpus in Western Balochi (Rakhshani) including articles, research, translations, and creative literature. (~1.1 m tokens)
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: bgn

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.26 MB

OpenCSG

smoltalk-chinese

SmolTalk-Chinese: A multi-task Chinese conversational dataset covering 19 typical dialogue task scenarios.
License Icon

License: Apache-2.0

Locale Icon

Locale: zh

Task Icon

Task: LLM

Format Icon

Format: parquet

Size Icon

Size: 879.81 MB

Balochistan Educational and Cultural Organization

Brahui Research Work Corpus

This corpus consists of research works, academic theses, and scholarly papers, comprising approximately 185,000 tokens. It covers a range of academic topics and formal registers, reflecting standardized writing practices and disciplinary conventions. The corpus is intended to support linguistic analysis, discourse studies, and computational research on academic and research-oriented text.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: brh

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.13 MB

Forum for Language Initiatives

Kohistani Shina Word List

A UTF-8 encoded Kohistani Shina dictionary and word list corpus with lexical entries, meanings, and annotations for linguistic research and NLP use.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: plk

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 394.05 KB

Forum for Language Initiatives

Khowar Word List

A UTF-8 encoded Khowar lexical corpus containing alphabet definitions and a structured word list for linguistic research and language technology development.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: khw

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 64.22 KB

Forum for Language Initiatives

Khowar Literature Corpus by FLI

A multi-genre Khowar language corpus designed for linguistic research, NLP applications, and cultural documentation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: khw

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 244.85 KB

Forum for Language Initiatives

Gojri Literature Corpus

A curated Gojri (Gujari) text corpus of approximately 60K tokens covering poetry, stories, short stories, and literary prose.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: gju

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 117.97 KB

Chishti Sons

Chishti Sons Punjabi Literature Corpus

This corpus is a collection of more than one million tokens of Western Punjabi language. The data was produced under the Chishti Sons publishing agency. The corpus contains work of literature including short stories, stories, fiction, non-fiction, and other literary books. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: pnb

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.65 MB

Baloch Publishers Multan

Baloch Publishers Saraiki Literature Corpus

This corpus is a collection of one million tokens of Saraiki language. The data was produced under the Baloch Publishers over the last ten years. The corpus contains work of literature including dictionaries, short stories, novels, fiction, non-fiction, and travelogue. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: skr

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.04 MB

Weekly Kaleem Magazine Multan

Kaleem Magazine Urdu Corpus

This corpus is a collection of around 1.4 million tokens of Urdu language. The data was extracted from the archives of a famous Urdu magazine "Kaleem" published weekly from last 30 years. This corpus contains work of literature including stories, short stories, news, poetry, literary reports, fiction, non-fiction, and travelogues. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: urd

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.74 MB

Kaleem Art Press

Kaleem Art Press Urdu Literature Corpus

This corpus is a collection of 1.44 million tokens of Urdu language . The data was produced under the Kaleem Art Press over the last fifteen years . The corpus contains work of literature including Stories, Short Stories, Novels, fiction, non-fiction, Travelogues, Poetry, Biography, and History. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: ur

Task Icon

Task: OTH

Format Icon

Format: TXT

Size Icon

Size: 2.85 MB

Rana Printers Multan

Rana Printers Urdu Literature Corpus

This corpus comprises 1.68 million tokens of high-quality Urdu text collected over the past decade through Rana Printers. It includes a diverse range of literary genres such as stories, short stories, novels, fiction, non-fiction, poetry, and historical works. All content is shared with the authors’ approval. The dataset is intended to support linguistic research, Urdu language technology development, and the preservation of literary and cultural heritage.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: ur

Task Icon

Task: OTH

Format Icon

Format: TXT

Size Icon

Size: 3.00 MB

Kaleem Art Press

Kaleem Art Press Saraiki Literature Corpus

This corpus contains approximately one million tokens of Saraiki text curated over the past ten years by Kaleem Art Press. It features a wide range of literary genres, including stories, short stories, novels, fiction, non-fiction, travelogues, poetry, biographies, and historical writings. All content is shared with full author approval. The dataset is intended to support linguistic research, Saraiki language technology development, and the preservation of cultural and literary heritage.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: skr

Task Icon

Task: OTH

Format Icon

Format: TXT

Size Icon

Size: 1.84 MB

Anjuman e Katib

Anjuman-e-Katib Farsi/Persian Literature Corpus

This corpus is a collection of more than one million tokens of Farsi/Persian language. The corpus contains work of literature including novels, fictional and non-fictional, poems and many more. the data is shared with the approval of the authors and aims to support linguistic research, language technology development and its preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: fas

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.82 MB

MEDIAMEN

Mediamen Punjabi Literature Corpus

This corpus is a collection of one million tokens of Western Punjabi language. The data was produced under the Mediamen publishing agency over the last ten years. The corpus contains work of literature including short stories, novels, fiction, non-fiction, and dramas. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: pnb

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.82 MB

Kaleem Art Press

Multilingual Religious Parallel Corpus (Kaleem Art Press)

This dataset is a multilingual parallel sentences corpus containing 6,465 aligned sentence units with approximately 0.98 million words, curated from Kaleem Art Press archives. It includes parallel religious text data in Arabic, Urdu, Saraiki (standard and dialectal), Punjabi (Shahmukhi), and English, supporting research in machine translation, comparative linguistics, digital humanities, and low-resource language studies.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: mul

Task Icon

Task: MT

Format Icon

Format: CSV

Size Icon

Size: 2.27 MB

Keblagh e Azergi

Keblagh-e-Azergi Hazargi literature corpus

This corpus is a collection of more than one hundred thousand tokens of Hazargi language. The corpus contains work of literature, poems, folk and short stories and dramas. The data is being shared with the approval of the authors. It claims to support linguistic research, language technology development and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: haz

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 193.28 KB

Digital Divide Data

Luhya ASR data subset 70 hours

A 70-hour subset of Luhya speech data collected by Digital Divide Data in Kenya. The dataset includes recorded sentences from native speakers and is intended to support research and development in Automatic Speech Recognition for low-resource African languages.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: luy

Task Icon

Task: ASR

Format Icon

Format: WAV, XLSX

Size Icon

Size: 13.90 GB

Aim Foundation

Aim Foundation Dari Literature Corpus

This corpus is a collection of more than seven hundred thousand tokens of Dari language. The corpus contains work of literature including poems, stories, novels, fictional and non-fictional and different articles. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: prs

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.74 MB