English–Punjabi (Shahmukhi) Parallel Sentences Corpus (Mediamen Archives)

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

MEDIAMEN

Task: MT

Release Date: 1/16/2026

Format: CSV

Size: 1.08 MB


Share

Description

This parallel sentences corpus containing 30,405 aligned sentence pairs with a total of approximately 0.62 million tokens, curated from the archival materials of Mediamen (Advertising Agency). The sentences were professionally translated from English into Punjabi (Shahmukhi) and are intended to support machine translation, linguistic research, and Punjabi language technology development, particularly for real-world and contemporary language use.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Forbidden Usage

* Generating or promoting hate speech, misinformation, or culturally offensive content * Any commercial or for-profit use without explicit permission * Any use that misrepresents the Punjabi language or cultural context

Processes

Ethical Review

The dataset was curated and translated under an ethical self-review process. It contains no sensitive, personal, or unauthorized copyrighted material and is released under the CC-BY-NC-4.0 license, upholding principles of cultural respect, transparency, and responsible research use.

Metadata

Dataset Overview

This corpus consists of sentence-level content originally produced in professional and public-facing contexts, reflecting modern, practical language usage. All translations were curated and reviewed under an ethical self-review process. The dataset is released exclusively for research and non-commercial use and provides valuable material for studying cross-lingual alignment, translation strategies, stylistic adaptation, and applied NLP for low-resource languages.

Languages Included

English

English serves as the source language in this dataset and represents contemporary usage commonly found in media, advertising, and public communication. Its inclusion supports international accessibility and cross-lingual research.

Punjabi

Punjabi is a major Indo-Aryan language spoken by millions in Pakistan and India. The language has a strong oral and written literary tradition and plays a central role in the cultural life of the Punjab region.

Variant

Punjabi has two main varieties: Eastern Punjabi and Western Punjabi. Eastern Punjabi is primarily spoken in the Indian Punjab and is written in an Indic script, while Western Punjabi is spoken in Punjab, Pakistan and uses the Perso-Arabic (Shahmukhi) script, which is the variety represented in this dataset.

List of Alphabets (Shahmukhi)

اآ ب پ ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن ں و ہ ھ ی ے ئ ء

Dataset Statistics

  • Sentence Pairs: 30,405

  • Total Token Count: ~0.62 million

  • Translation Direction: English → Punjabi (Shahmukhi)

  • Source: Mediamen (Advertising, Printing & Publishing Agency) archives

  • Content Type: Parallel sentences

  • Type of Sentences: Public-facing and professional text (advertising copy, product/service descriptions, promotional lines, short informational statements, and corporate/brand messaging)

Processing (recommended)

  • Unicode + formatting cleanup: apply NFC normalization, standardize whitespace, and fix punctuation spacing (especially around commas, quotes, and brackets).

  • Shahmukhi normalization: keep spellings consistent (e.g., ی/ے, ں vs ن, ہ/ھ usage) and remove accidental Latin digits/letters if they appear.

  • Alignment and quality checks: confirm each English sentence matches its Punjabi pair; flag pairs with missing text, severe length mismatch, or obvious mistranslation.

  • De-duplication + boilerplate removal: remove repeated agency taglines, templates, and near-duplicate lines so models don’t overfit to copy-paste patterns.

  • PII sweep (if any): detect phone numbers, emails, addresses, or person names in ads and mask/redact before release.

Sentence Pairs (English → Punjabi, Shahmukhi)

EnglishPunjabi (Shahmukhi)
His mobile phone produced radio emissions that interfered with other phones.اوہدے موبائل فون توں نکلن والیاں ریڈیو لہراں نے دوجے فوناں لئی مسئلہ پیدا کیتا۔
Some board members questioned his ability to run the corporation.کجھ بورڈ ممبراں نے اوہدی کارپوریشن چلان دی قابلیت تے سوال چکیا۔
His experience qualifies him to do the job.اوہدا تجربہ اوہنوں کم کرن دا اہل بناندا اے۔
You must make allowance for his inexperience.تہانوں اوہدے ناتجربہ کار ہون دا خیال رکھنا چاہیدا اے۔
Admitting his lack of experience, I still think that he ought to do better.اوہدے ناتجربہ کار ہون دے باوجود، مینوں اجے وی لگدا اے کہ اوہنوں بہتر کرنا چاہیدا اے۔
His background parallels that of his predecessor.اوہدا پس منظر اوہدے توں پہلاں والے ورگا ہی اے۔