Rana Printers Urdu Literature Corpus
License:
CC-BY-NC-4.0
Steward:
Rana Printers Multan
Task: OTH
Release Date: 12/3/2025
Format: TXT
Size: 3.00 MB
Description
This corpus comprises 1.68 million tokens of high-quality Urdu text collected over the past decade through Rana Printers. It includes a diverse range of literary genres such as stories, short stories, novels, fiction, non-fiction, poetry, and historical works. All content is shared with the authors’ approval. The dataset is intended to support linguistic research, Urdu language technology development, and the preservation of literary and cultural heritage.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
The data cannot be used by an organization having annual revenue more than one million USDs.
Forbidden Usage
Generating, promoting, or distributing hate speech, misinformation, or culturally offensive content. Commercial or for-profit projects without explicit permission from the dataset creator. Any use that misrepresents or distorts Urdu literature or the works contained within.
Processes
Ethical Review
The dataset was curated from publicly available Urdu literary sources and author-contributed materials, following an ethical self-review process by Rana Printers Multan. It contains no sensitive, restricted, or unauthorized copyrighted content. The collection adheres to CC-BY-NC-4.0 licensing and upholds principles of cultural respect, transparency, and non-commercial research use.
Intended Use
This dataset is intended for research, education, and non-commercial use in NLP, computational linguistics, and digital humanities. It may be used for training, fine-tuning, or evaluating models for Urdu language processing, and for linguistic and literary analysis supporting cultural preservation.
Metadata
Language
Urdu (اُردو) is an Indo-Aryan language that serves as the national language of Pakistan and a recognized official language in parts of India. While it shares linguistic roots with Hindi, Urdu is distinguished by its extensive vocabulary influenced by Persian and Arabic and its use of the Nastaliq-style Perso-Arabic script, written from right to left.
Content of the Corpus
The corpus contains following books from multiple authors written in Urdu
• Social Media Usage Instructions
• Anwar-e-Khatam-e-Nabowat
• Riyasat ka Taleemi Nizam
• Iztarab Aur Tangdasti ka Elaaj
• Jannat Ka Rasta
• Safar-e-Aqeedat
• Mukhtasir Sawaneh Hayaat Hazrat Al-Sheikh Behlavi
• Ye Mera Pakistan Hay
• Malfozat-e-Behlavi
• Tazkiya-tul-Aamaal
• Mawaiz-e-Sheikh Behlavi
• Fawaid-e-Quran Almaroof Ba Istalahaat-ul-Quran
• Zaad-ul-Miaad
• Khawateen Kay Liye Anmol Moti
• Quran Ka Faisla
• Khair-ul-Azkaar
• Hayat-un-Nabi
• Ilmi Majalis
• Jawahir-e-Khutbaat
• Barailviyat Kay Baghi Ulama-O-Mashaikh
• Maarfat Kay Phool
• Shaitaan Ki Hikayaat
• Khateeb-e-Azam Ameer-e-Shariat Syed Ata-Ullah Shah Bukhari
List of Alphabets
*اآ ب پ ٹ ث چ ح د ڈ ذ ڑ ژ س ش ص ط ظ ع غ ف ق ک گ ل م ن ں و ہ ھ ی ے ئ ء *
Sample Text
• زندگی ایک سفر ہے یہ سفر اُن ہستیوں کیلئے وسیلہ ظفر ہے جو ہمہ وقت، اپنے خالق کی یاد میں سانسوں کو مہکائے رکھتے ہیں اور حلقۂ ولایت میں شامل یہ ہستیاں اپنی قربتوں کی برکت سے مخلوق خدا کو فیض یاب کرتے ہیں۔
• علی رضا کے ساتھ گنگناتے ہوئے بچوں نے سر شفیق کی طرف دیکھا تو وہ بھی یہ شعر گنگنا رہے تھے۔
• انسانی وجوہ میں دو قوتیں رکھی ہیں۔ اگر ان کی اصلاح ہو جائے تو انسان کو نجات و سعادتِ عظمی مل سکتی ہے۔
• ماں باپ کی ناراضگی دنیا کا خسارہ ہے پیر استاد کی ناراضگی روحانیت کا خسارا ہے ۔
• شریعت احکام زندگی کا نام ہے اور طریقت یا تصوف اخلاق رزیلہ کے دفعیہ واخلاق رزیلہ حمیدہ سے متصف ہونے کا نام ہے۔
