Gojri Literature Corpus

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

Forum for Language Initiatives

Task: NLP

Release Date: 2/10/2026

Format: TXT

Size: 117.97 KB


Share

Description

The Gojri Literature Corpus ontains approximately 60,821 tokens of Gojri (Gujari) text drawn from poetry, short stories, narrative prose, and question–answer literary books. It reflects creative writing and traditional cultural expression, including social themes, folklore, and community knowledge, and supports linguistic research, NLP tasks (e.g., text analysis and language modeling), and the documentation and preservation of Gojri language and literary heritage.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

This dataset is intended solely for research, educational, and non-commercial use. Users must comply with the applicable license terms and provide appropriate attribution to the dataset creators and source institution. Commercial use or redistribution without prior permission is not allowed.

Forbidden Usage

- Any attempt to identify individuals, authors, or contributors beyond what is explicitly stated in the dataset metadata - Use of the dataset to infer personal, sensitive, or private information - Generating, promoting, or distributing hateful, misleading, or culturally offensive content - Use of the dataset in commercial or for-profit applications without explicit authorization

Processes

Ethical Review

The corpus was curated from literary and educational Gojri texts under an internal ethical review process by FLI. The dataset contains no sensitive or private personal data. All materials were reviewed to ensure they are suitable for open, non-commercial research use and do not violate copyright or cultural norms. The dataset is shared with respect for linguistic diversity, cultural integrity, and academic transparency.

Intended Use

This dataset is intended for linguistic research, corpus linguistics, low-resource language studies, NLP experimentation (including text analysis and language modeling), digital humanities research, and the preservation and study of Gojri language and literature.

Metadata

Language

Gojri (Gujari) is an Indo-Aryan language spoken by Gujjar communities across northern Pakistan and India, including parts of Kashmir, Khyber Pakhtunkhwa, Punjab, and the Himalayan region. It is closely related to Rajasthani varieties and is used in everyday conversation as well as in oral traditions such as folk songs, storytelling, and poetry. While widely spoken, Gojri has historically had limited published and standardized written resources, which makes curated text corpora important for language documentation, literacy support, and language technology development.

Dataset Structure

  • Total files: 11 UTF-8 text files

  • Each file represents a distinct genre or literary domain

  • File names correspond directly to the content

  • A cleaned version of the same 11 files is included after normalization

File-Level Metadata

  1. Seerat Nabi Arbi (Book) – 28,256 tokens

  2. Aqalmand Dhiyani (Story) – 2,046 tokens

  3. Dukh Ki Chaan (Story) – 1,376 tokens

  4. Sacha Qissa Haji Sb (Story) – 11,699 tokens

  5. Sacha Qissa Mumtaz (Story) – 2,808 tokens

  6. Teachers Story – 871 tokens

  7. Maharo Deen (Book – Q&A) – 11,635 tokens

  8. Shaal Kaka Ko Poot (Story) – 330 tokens

  9. Rasheed Ko Ghoro (Story, pictorial text) – 368 tokens

  10. Three Stories with Urdu Translations – 1,306 tokens

  11. Saeen Kaka Ki Bakri (Poem) – 126 tokens

Cleaning and Preprocessing

  • UTF-8 encoding

  • Unicode normalization

  • Whitespace and punctuation cleanup

  • Removal of stray symbols and markup

Sample Text

ساری دنیا، زمین، آسمان تے سارا انسان ایک اللہ نے بنایا۔ اُس نامُچ دُکھ لگو بارے مندا کماں کو مندو نتیجو وھے۔ خانہ بدوش بکروالاں نا سفر ما کئی قسم کا واقعہ پیش آوے۔ رشید کو پوت بیساکھ ما جمیو تے رشید نے مچ بڑی خوشی کری۔