Mediamen Punjabi Literature Corpus
License:
CC-BY-NC-4.0
Steward:
MEDIAMEN
Task: NLP
Release Date: 11/9/2025
Format: TXT
Size: 1.82 MB
Description
This corpus is a collection of one million tokens of Western Punjabi language. The data was produced under the Mediamen publishing agency over the last ten years. The corpus contains work of literature including short stories, novels, fiction, non-fiction, and dramas. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
The data cannot be used by an organization having annual revenue more than one million USDs.
Forbidden Usage
Generating, promoting, or distributing hate speech, misinformation, or culturally offensive content. Commercial or for-profit projects without explicit permission from the dataset creator. Any use that misrepresents or distorts Punjabi literature or the works contained within.
Processes
Ethical Review
The dataset was curated from publicly available or author-shared Punjabi literary sources under ethical self-review by MEDIAMEN. There is no sensitive or copyrighted material in this corpus. The collection aligns with CC-BY-NC-4.0 and principles of cultural respect, transparency, and non-commercial research use.
Intended Use
This dataset is intended for research, education, and non-commercial use in NLP, computational linguistics, and digital humanities. It may be used for training, fine-tuning, or evaluating models for Punjabi (Shahmukhi) language processing, and for linguistic and literary analysis supporting cultural preservation.
Metadata
Language
Punjabi is an Indo-Aryan language spoken in Punjab region of Pakistan and India. It is one of the most widely spoken native languages in the world, with approximately 150 million native speakers. The language is written in two scripts Shāhmukhī base on Perso-Arabic mostly used in Pakistan, and Gurmukhī inspired by Indic Scripts mostly used in the Indian region.
Content of the Corpus
The corpus contains following books from multiple authors written in Western Punjabi
Adyan wichkar gallan bataan
Ahmed Nadeem Qasmi Stories - Punjabi Trans.
Alghizu ul Fikri
Alaao Novel Trans.
Allah dy Ik hon da matlab
Anjany Khauf Afsany Trans.
Apny dukh menu dy deo Trans.
Deen diyan gallan
Urdu Chunvy Afsaniyan da Punjabi Tarjama
Magharbi Tamadun di Ik Jhalak
Mumtaz Mufti dy afsaniyan da Punjabi Tarjama
Noori Afsana
Haneriyan diyan kahaniyan
Bacheyan Lai Kahaniyan
Punjabi Stories
Punjabi Stories for Children
Asli Shehad tey hoor kahaniyan
Apny Hamsafar Punjabi Trans.
Punjabi Dramy
Punjabi Dramy-1
Putar Sun Kahani
Ag e Ag Punjabi Drama
Aan Zuban ty jan ty hor afsaniyan da Punjabi tarjama
Variants
Punjabi has two main varieties, Eastern and Western. Eastern Punjabi is spoken in the Indian Punjab and uses Indic script, whereas the Western Punjabi is spoken in Punjab, Pakistan and uses Perso-Arabic script.
List of Alphabets
*اآ ب پ ٹ ث چ ح د ڈ ذ ڑ ژ س ش ص ط ظ ع غ ف ق ک گ ل لؕ م ن ݨ ں و ہ ھ ی ے ئ ء *
Sample Text
اک روز آپا نوں چھیڑن دی خاطر میں بدو توں پُچھیا۔ ’’بدو بھلا بُجھو تے اوہ کشتی جیہڑی آپا دے پچھے پئی اے، اوہدے وچ کیہ اے؟ ایتھوں دور میداناں وچ، یا فیر پہاڑاں وچ جتھے ایہہ کشتی تہانوں پھڑ نہ سکے۔ اوہ اک مدت توں بعد فیر واپس آ گیا سی۔ ویلا لنگھدا گیا، پتہ نئیں کِنے ورھے لنگھ گئے۔ میری سوچ ایہہ اے کہ ایس چمک توں شاید کوئی راز ہووے، جیہڑا مینوں دریافت کرنا اے۔
