Saraiki-English Parallel Corpus

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

Kaleem Art Press

Task: MT

Release Date: 3/3/2026

Format: CSV

Size: 1.92 MB


Share

Description

This English–Saraiki Parallel Corpus is a curated bilingual dataset of 51,447 aligned sentence pairs (about 0.89 million words in total), translated from English into Saraiki by Kaleem Art Press and cleaned into a consistent sentence-level format for reliable alignment; it is designed to support machine translation training and evaluation, bilingual lexicon and terminology work, and broader linguistic and NLP research for Saraiki, including data-driven language technology development.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

The dataset may not be used by organizations with annual revenue exceeding USD 1 million.

Forbidden Usage

• Generating, promoting, or distributing hate speech, misinformation, or culturally offensive content. • Commercial or for-profit projects without explicit permission from the dataset creator. • Any use that misrepresents or distorts Saraiki literature or the works contained within.

Metadata

Languages

Saraiki

Saraiki (سرائیکی) is an Indo-Aryan language spoken by millions across southern Punjab and parts of Sindh, Khyber Pakhtunkhwa, and Balochistan in Pakistan. It has a distinct linguistic and cultural identity, supported by a rich literary tradition in poetry and prose. While it shares features with both Punjabi and Sindhi, Saraiki remains unique in its phonology and vocabulary. The language is written using a Perso-Arabic script.

English

English serves as the source language for this dataset. It represents contemporary, general-purpose written usage and supports cross-lingual research, international accessibility, and English → Saraiki translation modeling.

List of Saraiki Alphabets

آ ا ب ٻ پ ت ٹ ث ج ڄ چ ح خ د ڈ ݙ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ڳ ل م ن ں ݨ و ہ ھ ی ے

Content of the Corpus

The dataset consists of general-purpose sentences from Kaleem Art Press archives, professionally translated from English into Saraiki. It is designed to support research, language technology development, and the preservation of Saraiki linguistic resources.

Details of the Dataset

This corpus is a bilingual English–Saraiki parallel dataset containing 51,447 professionally translated sentence pairs (English → Saraiki) drawn from general-purpose sentences in the Kaleem Art Press archival texts, totaling approximately 0.89 million words across both languages. It is written in English and Saraiki (Perso-Arabic script) and is intended to support research, language technology development, and the preservation of Saraiki linguistic resources.

Dataset Statistics

  • Sentence Pairs: 51,447

  • Total Words (English + Saraiki): ~0.89 million

  • Translation Direction: English → Saraiki

  • Source: Kaleem Art Press archives

  • Content Type: Parallel, general-purpose sentences

  • Script: English (Latin), Saraiki (Perso-Arabic)

Processing (recommended)

  • Unicode normalization: apply consistent Unicode normalization (e.g., NFC) and standardize whitespace and punctuation.

  • Saraiki script consistency: normalize common spelling variants and ensure consistent use of Saraiki letters (e.g., ݙ، ڄ، ڳ، ں, etc.).

  • Alignment checks: verify that each English sentence matches its Saraiki translation; flag missing text, severe length mismatch, or misaligned pairs.

  • De-duplication: remove exact and near-duplicate sentence pairs to reduce repetition and improve training quality.

  • Basic filtering: optionally remove corrupted lines, mixed-script noise, or non-linguistic artifacts introduced during compilation.

English–Saraiki Sentence Pairs

EnglishSaraiki
What's your question?تُہاݙا سوال کیا ہِے؟
Even though it's not an exact copy of the painting, it contains the same elements.بھان٘ویں جو اِیہ فن پارے دی سان٘ویں نقل کائے نِھیں، اِین٘دے وِچ اُوہے عنصر رَلّے ہوئے ہِن۔
Pick-Up from a Post Parcel Point in your Cityآپݨے شہر وِچ ڈاک پارسل جاء کنوں گِھنّو
Appointment as a member of Honourعِزّت دے ممبر دے طور تے تقرری
so is two minutes to listen?تاں سُݨن کِیتے ݙُو مِنٹ ہِن؟
can you spell that for me please?بَھلا تُساں میݙے کِیتے اِین٘دے ہِجّے کر سڳدے ہِیوے؟
In the Replenishment Type Drop down, display all the Replenishment Types Configured.وَلا بھرݨ دی ونکی ڈراپ ڈاؤن وِچ، وَلا بھرݨ دیاں ساریاں ونکیاں کوں کنفیگرڈ ݙِکھاؤ۔
The latin root mater means mother, find a word in paragraph 2 with the root materلاطنی روٹ میٹر دا مطلب امّاں ہِے، روٹ میٹر دے نال پیراگراف 2 وِچ ہِک لوّظ لبّھو۔
you do have a lunch napتُساں ݙوپہراں دے کھاݨے دے بعد قیلولہ کِیتا
recognize the differences between prokaryotic and eukaryotic cells.2.پروکاریوٹک اَتے یوکاریوٹک خلیاں دے وِچالے وِتّھی کوں پِچھاݨو۔2۔