IBT Torwali Literature Corpus

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

Collaborative Action For Research & Development (CARD)

Task: NLP

Release Date: 4/7/2026

Format: TXT

Size: 488.12 KB


Share

Description

The IBT Torwali Literature Corpus by is a text dataset of about 233,000 tokens from multiple literary and cultural domains, including poetry, folktales, biographies, educational materials, religious translations, and other literary texts. It contains 21 UTF-8 encoded text files and is useful for language preservation and NLP research.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

This dataset is intended for research, educational, and non-commercial use only.

Forbidden Usage

This dataset is not for commercial purposes and is only be used for research and educational purposes.

Processes

Ethical Review

This dataset ensures no personal data is included and must be used responsibly with respect for Torwali culture and language.

Intended Use

This dataset is intended for Natural Language Processing (NLP) of the Torwali language.

Metadata

Language

Torwali (توروالی), also known as Bahrain Kohistani, is an Indo-Aryan language of the Kohistani group spoken in the Swat Kohistan region of northern Pakistan. It is considered an endangered Dardic language with approximately 100,000–120,000 speakers.

Torwali has deep historical roots and is believed to retain features close to ancient Gandhari languages. Revitalization efforts, including orthography development and community-based education programs, have been ongoing since the early 2000s.

Domains of the Text

  • Literature (Fiction / non-fiction)

  • Poetry (Aesthetic / cultural expression)

  • Folklore & Oral Tradition (Textual form)

  • Everyday Social Themes

  • Cultural Knowledge & Heritage

  • Articles (Aesthetic / cultural expression)

Torwali Script

آ اَ ٲ ب پ ت ٹ ث ج چ ڇ خ د ذ ڑ ر ز ڙ ژ ط ض ص ش ݜ س ظ غ ف ق ک گ ل م ن و ہ ی ء او

Dataset Structure

  • The dataset contains 21 text files.

  • Each file represents a specific literary or thematic category.

  • All files are cleaned and normalized in UTF-8 format.

  • Each file serves as an independent domain/genre container.

File-Level Metadata

  • 01-Torwali Literature (Poems & Articles) by Mahmood & Nasir - 16283 T.txt

  • 02-Torwali Literature (Poetry by Saleem Janbaaz) - 13383 T.txt

  • 03-Torwali Literature (Funny Poetry by Saleem Janbaaz) - 4146 T.txt

  • 04-Kashmala - Torwali Romantic Novel - by Iqbal Khan - Idarah Baraye Taleem-o-Taraqi Bahrain - 9304 T.txt

  • 05-Fatima Jinnah - Biography - by Idarah Baraye Taleem-o-Taraqi Bahrain - 12952 T.txt

  • 06-Allama Muhammad Iqbal - Biography - by Idarah Baraye Taleem-o-Taraqi Bahrain - 6841 T.txt

  • 07-Sawanih Maulana Room - Biography - by Idarah Baraye Taleem-o-Taraqi Bahrain - 21635 T.txt

  • 08-توروالی متل - Torwali Matal - 2354 T.txt

  • 09-Torwali Folktales - by Javid Iqbal - 10290 T.txt

  • 10-Torwali Folktales Collection - 15842 T.txt

  • 11-Maulana Romi - Biography - 18620 T.txt

  • 12-Torwali Teacher Guide - by Idarah Baraye Taleem-o-Taraqi Bahrain - 18019 T.txt

  • 13-Last 10 Parah of Quran - Torwali Translation - by Idarah Baraye Taleem-o-Taraqi Bahrain - 5720 T.txt

  • 14-Muhammad Ali Jinnah - Biography - by Idarah Baraye Taleem-o-Taraqi Bahrain - 11065 T.txt

  • 15-Torwali Misl (Proverbs) - 2354 T.txt

  • 16-Torwali Sentences - 4579 T.txt

  • 17-Torwali Poetry Collection by Javid Iqbal - 2090 T.txt

  • 18-Torwali Folktales - by Javid Iqbal - 10290 T.txt

  • 19-Hazrat Sultan Bahoo - Biography - Translated by Rahim Sabir - 14153 T.txt

  • 20-Torwali Literature (Stories, Folktales, Articles & Mix Text) - 24477 T.txt

  • 21-Torwali Poems - 8712 T.txt

Cleaning (Clean Layer)

  • UTF-8 encoding applied.

  • Unicode normalization for script consistency.

  • White-space and punctuation cleanup.

  • Removal of stray symbols and formatting artifacts.

Sample Text

  • ڈاکٹر علامہ محمد اقبال بیِشم صدی سی پیانیل شاعر، لیِکھک، وکیل، سیاست دان، مسلمان صوفی آں تحریک پاکستان سی خاص گیِر خلَگا می شامل آشُو۔ علامہ اقبال اُردو او فارسی جیِب می شاعری کؤدُود۔ علامہ اقبال سی مشہوری سی اصل وجہ تیِسی شاعری أشی۔

  • تیسما علاوہ تمام پاکستان می پد کے پدے خلگ ئے فاطمہ جناح سی غائبانہ زیِناز گُوزاد آں تیِسی آرواحا قرآن شریف سی ختم ہُم کےکیدے بگوشیی۔

  • اقبال خان مھی مأشو آشُو۔ عُمُو می مھأما گھن آشُو خو اے غورا دوست ہُم آشُو تے وجہ دے آ تنُو حی سی بأت تیسیت کؤبھؤدُود۔ اقبال خان سی شاعری بُوڑا چیر مأ تنُو آواز می بینی ئی۔ اقبال سی چھلے شاعر توروالی جیِب می چیر کم ہونیِن۔

  • تعلیم اوتربیت : مولانا روم بُنیادی سبق تنُو بوپ شمس العلماء ما بنُوشُو۔ تیلا پأش مولانا تیِسی بوپ تنُو خاص مُرید آں گھن عالم ولی اللّٰہ سیّد بُرہان الدین ترمذی سی حوالہ کی ۔

  • کھوؤ اچار شیِدل کھادو گھومار ڈھے می کیمی ہوئی ڇھیک نین