Baloch Publishers Saraiki Literature Corpus

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

Baloch Publishers Multan

Task: NLP

Release Date: 11/17/2025

Format: TXT

Size: 2.04 MB


Description

This corpus is a collection of one million tokens of Saraiki language. The data was produced under the Baloch Publishers over the last ten years. The corpus contains work of literature including dictionaries, short stories, novels, fiction, non-fiction, and travelogue. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

The data cannot be used by an organization having annual revenue more than one million USDs.

Forbidden Usage

Generating, promoting, or distributing hate speech, misinformation, or culturally offensive content. Commercial or for-profit projects without explicit permission from the dataset creator. Any use that misrepresents or distorts Saraiki literature or the works contained within.

Processes

Ethical Review

The dataset was curated from publicly available or author-shared Saraiki literary sources under ethical self-review by Baloch Publishers. There is no sensitive or copyrighted material in this corpus. The collection aligns with CC-BY-NC-4.0 and principles of cultural respect, transparency, and non-commercial research use.

Intended Use

This dataset is intended for research, education, and non-commercial use in NLP, computational linguistics, and digital humanities. It may be used for training, fine-tuning, or evaluating models for Saraiki language processing, and for linguistic and literary analysis supporting cultural preservation.

Metadata

Language

Saraiki is an Indo-Aryan language spoken by millions in Pakistan's southern Punjab, parts of Sindh, Khyber Pakhtunkhwa, and Balochistan. It has a distinct identity with its own rich literary tradition, including poetry and prose. It shares similarities with both Punjabi and Sindhi, and is written in a Perso-Arabic script.

Content of the Corpus

The corpus contains following books from multiple authors written in Saraiki language

  • Tohfa-e-Darvaish

  • Multan Kanun Patyalay Taeen

  • Shaukat-ul-Lughat

  • Tafheem Kalaam Baba Farid

  • Saraiki Poems multiple Authors

  • Qissa Karan Sat Puttha

  • Istalahat-e-Peshawaran

  • Gool

  • Saraiki Afsanchy

  • Koi Asman ten Kalha Hosi

  • Small Saraiki Books Collection

  • Saraiki Sentences (Corpus)

List of Alphabets

آ ا ب ٻ پ ت ٹ ث ج ڄ چ ح خ د ڈ ݙ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ڳ ل م ن ں ݨ و ہ ھ ی ے

Sample Text:

ڄیکر ڳالھ ایویں ہے، ول تاں کوئی نہ کوئی مسئلہ ضرور ہے۔ تصوف کیا ہے ۔ خدا کوں ملݨ یا ہونکوں دریافت کرݨ یا ڈیکھݨ دی شدید ترین آرزو دا ڈوجھاناں ہے ۔ اُوندی ٹُردیں کَنڈ کُوں ݙیکھ گھدے توڑے کَنڈ ہئی ساݙا حج تھی پَئے اسلم فقیر دی ڳالھیں سُݨ ٻہوں خوش تھیا ۔ دل وِچ ٻہوں اطمینان کیتا تے ادب نال فقیر کوںسلام کرکے چلا ڳیا۔ میکوں جاء مل ڳئی تے میݙے کھوتے کوں وی جاء مل ڳئی ہُݨ اے عورت توں ٻال ڄما یا نہ ڄما ایندے بعد حضرت نے ٹھیکری (پھکری) انہاں کوں ݙے کے آکھیا جو عورت دے ڈڈھ اُتے اے ٹھیکری رکھ ݙیو۔