Kaleem Art Press Saraiki Literature Corpus
License:
CC-BY-NC-4.0
Steward:
Kaleem Art Press
Task: OTH
Release Date: 12/3/2025
Format: TXT
Size: 1.84 MB
Description
This corpus contains approximately one million tokens of Saraiki text curated over the past ten years by Kaleem Art Press. It features a wide range of literary genres, including stories, short stories, novels, fiction, non-fiction, travelogues, poetry, biographies, and historical writings. All content is shared with full author approval. The dataset is intended to support linguistic research, Saraiki language technology development, and the preservation of cultural and literary heritage.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
The data cannot be used by an organization having annual revenue more than one million USDs.
Forbidden Usage
Generating, promoting, or distributing hate speech, misinformation, or culturally offensive content. Commercial or for-profit projects without explicit permission from the dataset creator. Any use that misrepresents or distorts Saraiki literature or the works contained within.
Processes
Ethical Review
The dataset was curated from publicly available and author-contributed Saraiki literary sources, following an ethical self-review process by Kaleem Art Press. It contains no sensitive content or unauthorized copyrighted material. The collection complies with the CC-BY-NC-4.0 license and upholds principles of cultural respect, transparency, and non-commercial research use.
Intended Use
This dataset is intended for research, education, and non-commercial use in NLP, computational linguistics, and digital humanities. It may be used for training, fine-tuning, or evaluating models for Saraiki language processing, and for linguistic and literary analysis supporting cultural preservation.
Metadata
Language
Saraiki (سرائیکی) is an Indo-Aryan language spoken by millions across southern Punjab and parts of Sindh, Khyber Pakhtunkhwa, and Balochistan in Pakistan. It has a distinct linguistic and cultural identity, supported by a rich literary tradition in poetry and prose. While it shares features with both Punjabi and Sindhi, Saraiki remains unique in its phonology and vocabulary. The language is written using a Perso-Arabic script.
Content of the Corpus
The corpus contains following books from multiple authors written in Saraiki language
• Habrras
• Mazameen-e-Quran
• Multan di Riwayati Khattati
• Saukhy Saraiki Tarjamay Ala Quran Majeed
• Al Marjaan Fe Tarjmatil Quran
• Akhaanr (Saraiki Proverbs)
• Saraiki Idioms Collection
• Saraiki Sentences Collection
• Translation of Surah Baqrah
• Tareekh Qutub Shahi Khokhar
• Tareekh Nama (Manzoom)
• Waseem Siddiqui Dharti da Shair
• Ashk-e-Aqeedat
• Majmoa-e-Kalam Mushtaq Sabqat
• Saraiki Adab Tareekh Nawisi
• Wichriyan Koonjan
• Saraiki Chonrrwain Likhtan
List of Alphabets
آ ا ب ٻ پ ت ٹ ث ج ڄ چ ح خ د ڈ ݙ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ڳ ل م ن ں ݨ و ہ ھ ی ے
Sample Text:
مغلیہ دَور وی خطاطی دے حوالے نال سنہری دور سݙوِیندا ہِے۔ ایں دور اِچ خطاط صاحبان کوں شاہی سرپرستی حاصل ہئی۔ خطاطی دے ماہراں کوں اعلیٰ منصب اَتے جاگیراں ݙتیاں ڳیا۔ سرائیکی زُبان کُوں پڑھن کوئی مشکل نِیں۔ کئی سرائیکی بِھرا آپݨی ماء ٻولی کُوں اِین٘ویں اوپرا تِھی تِھی تے پڑھدے ہِن ڄیویں انہاں کݙاہیں ٻولی وی نہ ہووےـ • جسرت کھوکھر تلنبہ دا حاکم ہا۔ جݙاں امیر تیمور نے ملتان توں لاہور آلے پاسے پیش قدمی کیتی، تاں اے رستے وچ مزاحم تھیا اتے شکست کھا تے گرفتار تھیا۔ بعد وچ اینکوں ݙو لکھ روپے دے بدلے رہا کر ݙتا ڳیا۔
• ایہ ہِک خاص کتاب ہے، جِیندے وِچ کوئی شک والی ڳالھ کائے نھیں پرہیز گار لوکاں کِیتے ہدایت ہے
• ڄہڑیلے اَوازار تِھیسِن اڳواݨ اپݨے پیروکاراں توں، اَتے ݙیکھسِن عذاب اَتے کپِیڄ ویسِن اُنّھاں سبھ دیاں ݙوراں۔
