Saraiki-English Parallel Corpus
License:
CC-BY-NC-4.0
Steward:
Kaleem Art PressTask: MT
Release Date: 3/3/2026
Format: CSV
Size: 1.92 MB
Share
Description
This English–Saraiki Parallel Corpus is a curated bilingual dataset of 51,447 aligned sentence pairs (about 0.89 million words in total), translated from English into Saraiki by Kaleem Art Press and cleaned into a consistent sentence-level format for reliable alignment; it is designed to support machine translation training and evaluation, bilingual lexicon and terminology work, and broader linguistic and NLP research for Saraiki, including data-driven language technology development.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
The dataset may not be used by organizations with annual revenue exceeding USD 1 million.
Forbidden Usage
• Generating, promoting, or distributing hate speech, misinformation, or culturally offensive content. • Commercial or for-profit projects without explicit permission from the dataset creator. • Any use that misrepresents or distorts Saraiki literature or the works contained within.
Metadata
Languages
Saraiki
Saraiki (سرائیکی) is an Indo-Aryan language spoken by millions across southern Punjab and parts of Sindh, Khyber Pakhtunkhwa, and Balochistan in Pakistan. It has a distinct linguistic and cultural identity, supported by a rich literary tradition in poetry and prose. While it shares features with both Punjabi and Sindhi, Saraiki remains unique in its phonology and vocabulary. The language is written using a Perso-Arabic script.
English
English serves as the source language for this dataset. It represents contemporary, general-purpose written usage and supports cross-lingual research, international accessibility, and English → Saraiki translation modeling.
List of Saraiki Alphabets
آ ا ب ٻ پ ت ٹ ث ج ڄ چ ح خ د ڈ ݙ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ڳ ل م ن ں ݨ و ہ ھ ی ے
Content of the Corpus
The dataset consists of general-purpose sentences from Kaleem Art Press archives, professionally translated from English into Saraiki. It is designed to support research, language technology development, and the preservation of Saraiki linguistic resources.
Details of the Dataset
This corpus is a bilingual English–Saraiki parallel dataset containing 51,447 professionally translated sentence pairs (English → Saraiki) drawn from general-purpose sentences in the Kaleem Art Press archival texts, totaling approximately 0.89 million words across both languages. It is written in English and Saraiki (Perso-Arabic script) and is intended to support research, language technology development, and the preservation of Saraiki linguistic resources.
Dataset Statistics
Sentence Pairs: 51,447
Total Words (English + Saraiki): ~0.89 million
Translation Direction: English → Saraiki
Source: Kaleem Art Press archives
Content Type: Parallel, general-purpose sentences
Script: English (Latin), Saraiki (Perso-Arabic)
Processing (recommended)
Unicode normalization: apply consistent Unicode normalization (e.g., NFC) and standardize whitespace and punctuation.
Saraiki script consistency: normalize common spelling variants and ensure consistent use of Saraiki letters (e.g., ݙ، ڄ، ڳ، ں, etc.).
Alignment checks: verify that each English sentence matches its Saraiki translation; flag missing text, severe length mismatch, or misaligned pairs.
De-duplication: remove exact and near-duplicate sentence pairs to reduce repetition and improve training quality.
Basic filtering: optionally remove corrupted lines, mixed-script noise, or non-linguistic artifacts introduced during compilation.
English–Saraiki Sentence Pairs
| English | Saraiki |
|---|---|
| What's your question? | تُہاݙا سوال کیا ہِے؟ |
| Even though it's not an exact copy of the painting, it contains the same elements. | بھان٘ویں جو اِیہ فن پارے دی سان٘ویں نقل کائے نِھیں، اِین٘دے وِچ اُوہے عنصر رَلّے ہوئے ہِن۔ |
| Pick-Up from a Post Parcel Point in your City | آپݨے شہر وِچ ڈاک پارسل جاء کنوں گِھنّو |
| Appointment as a member of Honour | عِزّت دے ممبر دے طور تے تقرری |
| so is two minutes to listen? | تاں سُݨن کِیتے ݙُو مِنٹ ہِن؟ |
| can you spell that for me please? | بَھلا تُساں میݙے کِیتے اِین٘دے ہِجّے کر سڳدے ہِیوے؟ |
| In the Replenishment Type Drop down, display all the Replenishment Types Configured. | وَلا بھرݨ دی ونکی ڈراپ ڈاؤن وِچ، وَلا بھرݨ دیاں ساریاں ونکیاں کوں کنفیگرڈ ݙِکھاؤ۔ |
| The latin root mater means mother, find a word in paragraph 2 with the root mater | لاطنی روٹ میٹر دا مطلب امّاں ہِے، روٹ میٹر دے نال پیراگراف 2 وِچ ہِک لوّظ لبّھو۔ |
| you do have a lunch nap | تُساں ݙوپہراں دے کھاݨے دے بعد قیلولہ کِیتا |
| recognize the differences between prokaryotic and eukaryotic cells.2. | پروکاریوٹک اَتے یوکاریوٹک خلیاں دے وِچالے وِتّھی کوں پِچھاݨو۔2۔ |