Chishti Sons Punjabi Literature Corpus
License:
CC-BY-NC-4.0
Steward:
Chishti Sons
Task: NLP
Release Date: 11/17/2025
Format: TXT
Size: 1.65 MB
Description
This corpus is a collection of more than one million tokens of Western Punjabi language. The data was produced under the Chishti Sons publishing agency. The corpus contains work of literature including short stories, stories, fiction, non-fiction, and other literary books. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
The data cannot be used by an organization having annual revenue more than one million USDs.
Forbidden Usage
Generating, promoting, or distributing hate speech, misinformation, or culturally offensive content. Commercial or for-profit projects without explicit permission from the dataset creator. Any use that misrepresents or distorts Punjabi literature or the works contained within.
Processes
Ethical Review
The dataset was curated from publicly available or author-shared Punjabi literary sources under ethical self-review by Chishti Sons Publishers. There is no sensitive or copyrighted material in this corpus. The collection aligns with CC-BY-NC-4.0 and principles of cultural respect, transparency, and non-commercial research use.
Intended Use
This dataset is intended for research, education, and non-commercial use in NLP, computational linguistics, and digital humanities. It may be used for training, fine-tuning, or evaluating models for Punjabi (Shahmukhi) language processing, and for linguistic and literary analysis supporting cultural preservation.
Metadata
Language
Punjabi is an Indo-Aryan language spoken in Punjab region of Pakistan and India. It is one of the most widely spoken native languages in the world, with approximately 150 million native speakers. The language is written in two scripts Shāhmukhī base on Perso-Arabic mostly used in Pakistan, and Gurmukhī inspired by Indic Scripts mostly used in the Indian region.
Content of the Corpus
The corpus contains following books from multiple authors written in Western Punjabi • Sunan Ibn-e-Maja • Tafheem-ul-Quran • Mukhtasar Muslim • Adhi Kahani Tay Hor Kahaniyan • Rab di Akh Naal • Manto Punjabi Wich (Afsany) • Main Tay Tun Day Afsaany • Islam da Taaruf • Punjabi Afsaany Collection
Variants
Punjabi has two main varieties, Eastern and Western. Eastern Punjabi is spoken in the Indian Punjab and uses Indic script, whereas the Western Punjabi is spoken in Punjab, Pakistan and uses Perso-Arabic script.
List of Alphabets
*اآ ب پ ٹ ث چ ح د ڈ ذ ڑ ژ س ش ص ط ظ ع غ ف ق ک گ ل م ن ݨ ں و ہ ھ ی ے ئ ء *
Sample Text
گرمیاں دا موسم چل رہیا سی۔ ایس ورھے امباں دی فصل بڑی چنگی ہوئی سی۔ بازاراں وچ، گلیاں وچ، دکانداراں کول، ریڑھیاں والیاں کول، بندے دا دھیان جدر وی پیندا سی، ہر پاسے امب ای امب وکھالی دیندے سن۔ رب دی اکھ نال ویکھو تے انسان اوہدی بہترین تخلیق اے۔ رب دی اکھ نال ویکھو تے دنیا اینج دی مکمل تربیت گاہ اے جس دی اک مچھر ورگی حقیر شے وی فضول نئیں بنائی گئی۔ بہتر ہووے گا کہ ایس مسئلے تے اسیں گھر اپڑ کے گل کرئیے۔ فی الحال توں تیاری پھڑ، میں آ رہیا واں۔ مسلمان بھرا نال چنگی تے فائدے مند گل کیتی جاوے، تے مسلماناں دیاں لوڑاں نوں پورا کیتا جاوے، تے غریب قرض دار نوں مہلت دتی جاوے، اک دوجے تے قربانی دتی جاوے، تے غم خواری تے تعزیت کیتی جاوے، لوکاں نال ہسدے ہوئے چہرے نال ملیا جاوے۔ یونان دی بت پرستاں والی دھرتی دی گود وچ، اک شاعر دی گیت ورگی سوہنی ہستی پل رہی سی۔۔۔۔۔
