Chishti Sons Punjabi Literature Corpus

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

Chishti Sons

Task: NLP

Release Date: 11/17/2025

Format: TXT

Size: 1.65 MB


Description

This corpus is a collection of more than one million tokens of Western Punjabi language. The data was produced under the Chishti Sons publishing agency. The corpus contains work of literature including short stories, stories, fiction, non-fiction, and other literary books. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

The data cannot be used by an organization having annual revenue more than one million USDs.

Forbidden Usage

Generating, promoting, or distributing hate speech, misinformation, or culturally offensive content. Commercial or for-profit projects without explicit permission from the dataset creator. Any use that misrepresents or distorts Punjabi literature or the works contained within.

Processes

Ethical Review

The dataset was curated from publicly available or author-shared Punjabi literary sources under ethical self-review by Chishti Sons Publishers. There is no sensitive or copyrighted material in this corpus. The collection aligns with CC-BY-NC-4.0 and principles of cultural respect, transparency, and non-commercial research use.

Intended Use

This dataset is intended for research, education, and non-commercial use in NLP, computational linguistics, and digital humanities. It may be used for training, fine-tuning, or evaluating models for Punjabi (Shahmukhi) language processing, and for linguistic and literary analysis supporting cultural preservation.

Metadata

Language

Punjabi is an Indo-Aryan language spoken in Punjab region of Pakistan and India. It is one of the most widely spoken native languages in the world, with approximately 150 million native speakers. The language is written in two scripts Shāhmukhī base on Perso-Arabic mostly used in Pakistan, and Gurmukhī inspired by Indic Scripts mostly used in the Indian region.

Content of the Corpus

The corpus contains following books from multiple authors written in Western Punjabi • Sunan Ibn-e-Maja • Tafheem-ul-Quran • Mukhtasar Muslim • Adhi Kahani Tay Hor Kahaniyan • Rab di Akh Naal • Manto Punjabi Wich (Afsany) • Main Tay Tun Day Afsaany • Islam da Taaruf • Punjabi Afsaany Collection

Variants

Punjabi has two main varieties, Eastern and Western. Eastern Punjabi is spoken in the Indian Punjab and uses Indic script, whereas the Western Punjabi is spoken in Punjab, Pakistan and uses Perso-Arabic script.

List of Alphabets

*اآ ب پ ٹ ث چ ح د ڈ ذ ڑ ژ س ش ص ط ظ ع غ ف ق ک گ ل م ن ݨ ں و ہ ھ ی ے ئ ء *

Sample Text

گرمیاں دا موسم چل رہیا سی۔ ایس ورھے امباں دی فصل بڑی چنگی ہوئی سی۔ بازاراں وچ، گلیاں وچ، دکانداراں کول، ریڑھیاں والیاں کول، بندے دا دھیان جدر وی پیندا سی، ہر پاسے امب ای امب وکھالی دیندے سن۔ رب دی اکھ نال ویکھو تے انسان اوہدی بہترین تخلیق اے۔ رب دی اکھ نال ویکھو تے دنیا اینج دی مکمل تربیت گاہ اے جس دی اک مچھر ورگی حقیر شے وی فضول نئیں بنائی گئی۔ بہتر ہووے گا کہ ایس مسئلے تے اسیں گھر اپڑ کے گل کرئیے۔ فی الحال توں تیاری پھڑ، میں آ رہیا واں۔ مسلمان بھرا نال چنگی تے فائدے مند گل کیتی جاوے، تے مسلماناں دیاں لوڑاں نوں پورا کیتا جاوے، تے غریب قرض دار نوں مہلت دتی جاوے، اک دوجے تے قربانی دتی جاوے، تے غم خواری تے تعزیت کیتی جاوے، لوکاں نال ہسدے ہوئے چہرے نال ملیا جاوے۔ یونان دی بت پرستاں والی دھرتی دی گود وچ، اک شاعر دی گیت ورگی سوہنی ہستی پل رہی سی۔۔۔۔۔