English–Punjabi (Shahmukhi) Parallel Sentences Corpus (Mediamen Archives)
License:
CC-BY-NC-4.0
Steward:
MEDIAMEN
Task: MT
Release Date: 1/16/2026
Format: CSV
Size: 1.08 MB
Share
Description
This parallel sentences corpus containing 30,405 aligned sentence pairs with a total of approximately 0.62 million tokens, curated from the archival materials of Mediamen (Advertising Agency). The sentences were professionally translated from English into Punjabi (Shahmukhi) and are intended to support machine translation, linguistic research, and Punjabi language technology development, particularly for real-world and contemporary language use.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Forbidden Usage
* Generating or promoting hate speech, misinformation, or culturally offensive content * Any commercial or for-profit use without explicit permission * Any use that misrepresents the Punjabi language or cultural context
Processes
Ethical Review
The dataset was curated and translated under an ethical self-review process. It contains no sensitive, personal, or unauthorized copyrighted material and is released under the CC-BY-NC-4.0 license, upholding principles of cultural respect, transparency, and responsible research use.
Metadata
Dataset Overview
This corpus consists of sentence-level content originally produced in professional and public-facing contexts, reflecting modern, practical language usage. All translations were curated and reviewed under an ethical self-review process. The dataset is released exclusively for research and non-commercial use and provides valuable material for studying cross-lingual alignment, translation strategies, stylistic adaptation, and applied NLP for low-resource languages.
Languages Included
English
English serves as the source language in this dataset and represents contemporary usage commonly found in media, advertising, and public communication. Its inclusion supports international accessibility and cross-lingual research.
Punjabi
Punjabi is a major Indo-Aryan language spoken by millions in Pakistan and India. The language has a strong oral and written literary tradition and plays a central role in the cultural life of the Punjab region.
Variant
Punjabi has two main varieties: Eastern Punjabi and Western Punjabi. Eastern Punjabi is primarily spoken in the Indian Punjab and is written in an Indic script, while Western Punjabi is spoken in Punjab, Pakistan and uses the Perso-Arabic (Shahmukhi) script, which is the variety represented in this dataset.
List of Alphabets (Shahmukhi)
اآ ب پ ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن ں و ہ ھ ی ے ئ ء
Dataset Statistics
Sentence Pairs: 30,405
Total Token Count: ~0.62 million
Translation Direction: English → Punjabi (Shahmukhi)
Source: Mediamen (Advertising, Printing & Publishing Agency) archives
Content Type: Parallel sentences
Type of Sentences: Public-facing and professional text (advertising copy, product/service descriptions, promotional lines, short informational statements, and corporate/brand messaging)
Processing (recommended)
Unicode + formatting cleanup: apply NFC normalization, standardize whitespace, and fix punctuation spacing (especially around commas, quotes, and brackets).
Shahmukhi normalization: keep spellings consistent (e.g., ی/ے, ں vs ن, ہ/ھ usage) and remove accidental Latin digits/letters if they appear.
Alignment and quality checks: confirm each English sentence matches its Punjabi pair; flag pairs with missing text, severe length mismatch, or obvious mistranslation.
De-duplication + boilerplate removal: remove repeated agency taglines, templates, and near-duplicate lines so models don’t overfit to copy-paste patterns.
PII sweep (if any): detect phone numbers, emails, addresses, or person names in ads and mask/redact before release.
Sentence Pairs (English → Punjabi, Shahmukhi)
| English | Punjabi (Shahmukhi) |
|---|---|
| His mobile phone produced radio emissions that interfered with other phones. | اوہدے موبائل فون توں نکلن والیاں ریڈیو لہراں نے دوجے فوناں لئی مسئلہ پیدا کیتا۔ |
| Some board members questioned his ability to run the corporation. | کجھ بورڈ ممبراں نے اوہدی کارپوریشن چلان دی قابلیت تے سوال چکیا۔ |
| His experience qualifies him to do the job. | اوہدا تجربہ اوہنوں کم کرن دا اہل بناندا اے۔ |
| You must make allowance for his inexperience. | تہانوں اوہدے ناتجربہ کار ہون دا خیال رکھنا چاہیدا اے۔ |
| Admitting his lack of experience, I still think that he ought to do better. | اوہدے ناتجربہ کار ہون دے باوجود، مینوں اجے وی لگدا اے کہ اوہنوں بہتر کرنا چاہیدا اے۔ |
| His background parallels that of his predecessor. | اوہدا پس منظر اوہدے توں پہلاں والے ورگا ہی اے۔ |