Mediamen Punjabi Literature Corpus

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

MEDIAMEN

Task: NLP

Release Date: 11/9/2025

Format: TXT

Size: 1.82 MB


Description

This corpus is a collection of one million tokens of Western Punjabi language. The data was produced under the Mediamen publishing agency over the last ten years. The corpus contains work of literature including short stories, novels, fiction, non-fiction, and dramas. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

The data cannot be used by an organization having annual revenue more than one million USDs.

Forbidden Usage

Generating, promoting, or distributing hate speech, misinformation, or culturally offensive content. Commercial or for-profit projects without explicit permission from the dataset creator. Any use that misrepresents or distorts Punjabi literature or the works contained within.

Processes

Ethical Review

The dataset was curated from publicly available or author-shared Punjabi literary sources under ethical self-review by MEDIAMEN. There is no sensitive or copyrighted material in this corpus. The collection aligns with CC-BY-NC-4.0 and principles of cultural respect, transparency, and non-commercial research use.

Intended Use

This dataset is intended for research, education, and non-commercial use in NLP, computational linguistics, and digital humanities. It may be used for training, fine-tuning, or evaluating models for Punjabi (Shahmukhi) language processing, and for linguistic and literary analysis supporting cultural preservation.

Metadata

Language

Punjabi is an Indo-Aryan language spoken in Punjab region of Pakistan and India. It is one of the most widely spoken native languages in the world, with approximately 150 million native speakers. The language is written in two scripts Shāhmukhī base on Perso-Arabic mostly used in Pakistan, and Gurmukhī inspired by Indic Scripts mostly used in the Indian region.

Content of the Corpus

The corpus contains following books from multiple authors written in Western Punjabi

  • Adyan wichkar gallan bataan

  • Ahmed Nadeem Qasmi Stories - Punjabi Trans.

  • Alghizu ul Fikri

  • Alaao Novel Trans.

  • Allah dy Ik hon da matlab

  • Anjany Khauf Afsany Trans.

  • Apny dukh menu dy deo Trans.

  • Deen diyan gallan

  • Urdu Chunvy Afsaniyan da Punjabi Tarjama

  • Magharbi Tamadun di Ik Jhalak

  • Mumtaz Mufti dy afsaniyan da Punjabi Tarjama

  • Noori Afsana

  • Haneriyan diyan kahaniyan

  • Bacheyan Lai Kahaniyan

  • Punjabi Stories

  • Punjabi Stories for Children

  • Asli Shehad tey hoor kahaniyan

  • Apny Hamsafar Punjabi Trans.

  • Punjabi Dramy

  • Punjabi Dramy-1

  • Putar Sun Kahani

  • Ag e Ag Punjabi Drama

  • Aan Zuban ty jan ty hor afsaniyan da Punjabi tarjama

Variants

Punjabi has two main varieties, Eastern and Western. Eastern Punjabi is spoken in the Indian Punjab and uses Indic script, whereas the Western Punjabi is spoken in Punjab, Pakistan and uses Perso-Arabic script.

List of Alphabets

*اآ ب پ ٹ ث چ ح د ڈ ذ ڑ ژ س ش ص ط ظ ع غ ف ق ک گ ل لؕ م ن ݨ ں و ہ ھ ی ے ئ ء *

Sample Text

اک روز آپا نوں چھیڑن دی خاطر میں بدو توں پُچھیا۔ ’’بدو بھلا بُجھو تے اوہ کشتی جیہڑی آپا دے پچھے پئی اے، اوہدے وچ کیہ اے؟ ایتھوں دور میداناں وچ، یا فیر پہاڑاں وچ جتھے ایہہ کشتی تہانوں پھڑ نہ سکے۔ اوہ اک مدت توں بعد فیر واپس آ گیا سی۔ ویلا لنگھدا گیا، پتہ نئیں کِنے ورھے لنگھ گئے۔ میری سوچ ایہہ اے کہ ایس چمک توں شاید کوئی راز ہووے، جیہڑا مینوں دریافت کرنا اے۔