Gojri Literature Corpus

License:

CC-BY-NC-4.0

Steward:

Forum for Language Initiatives

Task: NLP

Release Date: 2/10/2026

Format: TXT

Size: 117.97 KB

Description

The Gojri Literature Corpus ontains approximately 60,821 tokens of Gojri (Gujari) text drawn from poetry, short stories, narrative prose, and question–answer literary books. It reflects creative writing and traditional cultural expression, including social themes, folklore, and community knowledge, and supports linguistic research, NLP tasks (e.g., text analysis and language modeling), and the documentation and preservation of Gojri language and literary heritage.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

This dataset is intended solely for research, educational, and non-commercial use. Users must comply with the applicable license terms and provide appropriate attribution to the dataset creators and source institution. Commercial use or redistribution without prior permission is not allowed.

Forbidden Usage

- Any attempt to identify individuals, authors, or contributors beyond what is explicitly stated in the dataset metadata - Use of the dataset to infer personal, sensitive, or private information - Generating, promoting, or distributing hateful, misleading, or culturally offensive content - Use of the dataset in commercial or for-profit applications without explicit authorization

Processes

Ethical Review

The corpus was curated from literary and educational Gojri texts under an internal ethical review process by FLI. The dataset contains no sensitive or private personal data. All materials were reviewed to ensure they are suitable for open, non-commercial research use and do not violate copyright or cultural norms. The dataset is shared with respect for linguistic diversity, cultural integrity, and academic transparency.

Intended Use

This dataset is intended for linguistic research, corpus linguistics, low-resource language studies, NLP experimentation (including text analysis and language modeling), digital humanities research, and the preservation and study of Gojri language and literature.

Metadata

Language

Gojri (Gujari) is an Indo-Aryan language spoken by Gujjar communities across northern Pakistan and India, including parts of Kashmir, Khyber Pakhtunkhwa, Punjab, and the Himalayan region. It is closely related to Rajasthani varieties and is used in everyday conversation as well as in oral traditions such as folk songs, storytelling, and poetry. While widely spoken, Gojri has historically had limited published and standardized written resources, which makes curated text corpora important for language documentation, literacy support, and language technology development.

Dataset Structure

Total files: 11 UTF-8 text files
Each file represents a distinct genre or literary domain
File names correspond directly to the content
A cleaned version of the same 11 files is included after normalization

File-Level Metadata

Seerat Nabi Arbi (Book) – 28,256 tokens
Aqalmand Dhiyani (Story) – 2,046 tokens
Dukh Ki Chaan (Story) – 1,376 tokens
Sacha Qissa Haji Sb (Story) – 11,699 tokens
Sacha Qissa Mumtaz (Story) – 2,808 tokens
Teachers Story – 871 tokens
Maharo Deen (Book – Q&A) – 11,635 tokens
Shaal Kaka Ko Poot (Story) – 330 tokens
Rasheed Ko Ghoro (Story, pictorial text) – 368 tokens
Three Stories with Urdu Translations – 1,306 tokens
Saeen Kaka Ki Bakri (Poem) – 126 tokens

Cleaning and Preprocessing

UTF-8 encoding
Unicode normalization
Whitespace and punctuation cleanup
Removal of stray symbols and markup

Sample Text

ساری دنیا، زمین، آسمان تے سارا انسان ایک اللہ نے بنایا۔ اُس نامُچ دُکھ لگو بارے مندا کماں کو مندو نتیجو وھے۔ خانہ بدوش بکروالاں نا سفر ما کئی قسم کا واقعہ پیش آوے۔ رشید کو پوت بیساکھ ما جمیو تے رشید نے مچ بڑی خوشی کری۔