IBT Torwali Literature Corpus
License:
CC-BY-NC-4.0
Steward:
Collaborative Action For Research & Development (CARD)Task: NLP
Release Date: 4/7/2026
Format: TXT
Size: 488.12 KB
Share
Description
The IBT Torwali Literature Corpus by is a text dataset of about 233,000 tokens from multiple literary and cultural domains, including poetry, folktales, biographies, educational materials, religious translations, and other literary texts. It contains 21 UTF-8 encoded text files and is useful for language preservation and NLP research.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
This dataset is intended for research, educational, and non-commercial use only.
Forbidden Usage
This dataset is not for commercial purposes and is only be used for research and educational purposes.
Processes
Ethical Review
This dataset ensures no personal data is included and must be used responsibly with respect for Torwali culture and language.
Intended Use
This dataset is intended for Natural Language Processing (NLP) of the Torwali language.
Metadata
Language
Torwali (توروالی), also known as Bahrain Kohistani, is an Indo-Aryan language of the Kohistani group spoken in the Swat Kohistan region of northern Pakistan. It is considered an endangered Dardic language with approximately 100,000–120,000 speakers.
Torwali has deep historical roots and is believed to retain features close to ancient Gandhari languages. Revitalization efforts, including orthography development and community-based education programs, have been ongoing since the early 2000s.
Domains of the Text
Literature (Fiction / non-fiction)
Poetry (Aesthetic / cultural expression)
Folklore & Oral Tradition (Textual form)
Everyday Social Themes
Cultural Knowledge & Heritage
Articles (Aesthetic / cultural expression)
Torwali Script
آ اَ ٲ ب پ ت ٹ ث ج چ ڇ خ د ذ ڑ ر ز ڙ ژ ط ض ص ش ݜ س ظ غ ف ق ک گ ل م ن و ہ ی ء او
Dataset Structure
The dataset contains 21 text files.
Each file represents a specific literary or thematic category.
All files are cleaned and normalized in UTF-8 format.
Each file serves as an independent domain/genre container.
File-Level Metadata
01-Torwali Literature (Poems & Articles) by Mahmood & Nasir - 16283 T.txt
02-Torwali Literature (Poetry by Saleem Janbaaz) - 13383 T.txt
03-Torwali Literature (Funny Poetry by Saleem Janbaaz) - 4146 T.txt
04-Kashmala - Torwali Romantic Novel - by Iqbal Khan - Idarah Baraye Taleem-o-Taraqi Bahrain - 9304 T.txt
05-Fatima Jinnah - Biography - by Idarah Baraye Taleem-o-Taraqi Bahrain - 12952 T.txt
06-Allama Muhammad Iqbal - Biography - by Idarah Baraye Taleem-o-Taraqi Bahrain - 6841 T.txt
07-Sawanih Maulana Room - Biography - by Idarah Baraye Taleem-o-Taraqi Bahrain - 21635 T.txt
08-توروالی متل - Torwali Matal - 2354 T.txt
09-Torwali Folktales - by Javid Iqbal - 10290 T.txt
10-Torwali Folktales Collection - 15842 T.txt
11-Maulana Romi - Biography - 18620 T.txt
12-Torwali Teacher Guide - by Idarah Baraye Taleem-o-Taraqi Bahrain - 18019 T.txt
13-Last 10 Parah of Quran - Torwali Translation - by Idarah Baraye Taleem-o-Taraqi Bahrain - 5720 T.txt
14-Muhammad Ali Jinnah - Biography - by Idarah Baraye Taleem-o-Taraqi Bahrain - 11065 T.txt
15-Torwali Misl (Proverbs) - 2354 T.txt
16-Torwali Sentences - 4579 T.txt
17-Torwali Poetry Collection by Javid Iqbal - 2090 T.txt
18-Torwali Folktales - by Javid Iqbal - 10290 T.txt
19-Hazrat Sultan Bahoo - Biography - Translated by Rahim Sabir - 14153 T.txt
20-Torwali Literature (Stories, Folktales, Articles & Mix Text) - 24477 T.txt
21-Torwali Poems - 8712 T.txt
Cleaning (Clean Layer)
UTF-8 encoding applied.
Unicode normalization for script consistency.
White-space and punctuation cleanup.
Removal of stray symbols and formatting artifacts.
Sample Text
ڈاکٹر علامہ محمد اقبال بیِشم صدی سی پیانیل شاعر، لیِکھک، وکیل، سیاست دان، مسلمان صوفی آں تحریک پاکستان سی خاص گیِر خلَگا می شامل آشُو۔ علامہ اقبال اُردو او فارسی جیِب می شاعری کؤدُود۔ علامہ اقبال سی مشہوری سی اصل وجہ تیِسی شاعری أشی۔
تیسما علاوہ تمام پاکستان می پد کے پدے خلگ ئے فاطمہ جناح سی غائبانہ زیِناز گُوزاد آں تیِسی آرواحا قرآن شریف سی ختم ہُم کےکیدے بگوشیی۔
اقبال خان مھی مأشو آشُو۔ عُمُو می مھأما گھن آشُو خو اے غورا دوست ہُم آشُو تے وجہ دے آ تنُو حی سی بأت تیسیت کؤبھؤدُود۔ اقبال خان سی شاعری بُوڑا چیر مأ تنُو آواز می بینی ئی۔ اقبال سی چھلے شاعر توروالی جیِب می چیر کم ہونیِن۔
تعلیم اوتربیت : مولانا روم بُنیادی سبق تنُو بوپ شمس العلماء ما بنُوشُو۔ تیلا پأش مولانا تیِسی بوپ تنُو خاص مُرید آں گھن عالم ولی اللّٰہ سیّد بُرہان الدین ترمذی سی حوالہ کی ۔
کھوؤ اچار شیِدل کھادو گھومار ڈھے می کیمی ہوئی ڇھیک نین