Khowar Literature Corpus by FLI
License:
CC-BY-NC-4.0
Steward:
Forum for Language InitiativesTask: NLP
Release Date: 2/10/2026
Format: TXT
Size: 244.85 KB
Share
Description
The Khowar Literature Corpus by FLI is a curated multi-genre textual dataset consisting of 12 UTF-8 encoded text files with a total of 108K tokens. It includes literary works, poetry, folklore narratives, magazine editions, translated books, articles, reports, and official documents, and supports corpus linguistics, low-resource NLP, digital humanities research, and the preservation of Khowar linguistic and cultural heritage.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
This dataset is provided strictly for research, academic, educational, and non-commercial purposes. Proper citation and acknowledgment of the original authors, translators, publishers, and dataset curators are required.
Forbidden Usage
- You agree not to attempt to identify any individuals referenced in the dataset. - The dataset must not be used to infer personal, private, or sensitive information. - It is forbidden to use this dataset to train chatbots or large language models intended for commercial deployment. - Any use that misrepresents cultural, religious, or social contexts is prohibited.
Processes
Ethical Review
All data included in this corpus originates from published literary, educational, and publicly distributed sources. Permissions were obtained from original authors or translators where applicable. The dataset does not contain private or confidential information, and ethical standards regarding consent, attribution, and responsible reuse have been fully observed.
Intended Use
This dataset is intended for linguistic research, corpus analysis, low-resource NLP tasks, language modeling, educational use, and cultural and literary documentation of the Khowar language.
Metadata
Language Information
Khowar (کھووار), also known as Chitrali, is an Indo-Aryan language of the Dardic group spoken primarily in the Chitral region of Pakistan. It serves as the lingua franca of Chitral and is also spoken in Gilgit-Baltistan, Upper Swat, and by migrant communities in major urban centers such as Islamabad, Karachi, Lahore, and Peshawar. Khowar is also used as a second language by the Kalash people.
Domains of the Text
Literature (creative writing)
Poetry (aesthetic and cultural expression)
Folklore and oral tradition (textual form)
Everyday social themes
Cultural knowledge and heritage
Magazine corpus (two editions)
Translated informational and legal texts
Script Information
The corpus is written in the Perso-Arabic Khowar script and includes language-specific characters unique to Khowar orthography.
Khowar Alphabets
چ ج ث ٹ ت پ ب ۱
ڈ د ځ څ ݮ ݯ خ ح
ش س ݱ ژ ڑ ز ر ذ
ف غ ع ظ ط ض ص ݰ
ہ و ن م ل گ ک ق
ے ی ء
Khowar-specific characters:
ݰ ݱ ځ څ ݮ ݯ
(ݰاپیک، ݱانگ، ځوخ، څیق، ݮینݮیر، ݯونگو)
Dataset Structure and Processing
Total files: 12 text files
Total size: 108,537 tokens
Encoding: UTF-8
Each file represents a distinct genre or domain
File names match their internal content
Cleaned version included after Unicode normalization
File-Level Metadata
Khowar Language Introduction – 1,519 tokens
Khowar Nama Magazine (11th Edition, 2018) – 14,905 tokens
Khowar Nama Magazine (12th Edition, 2019) – 16,500 tokens
Nivishia Introduction – 236 tokens
Qaso Kirdar – 166 tokens
Khowar Masaan Naam – 53 tokens
Robinson Crusoe (Translated) – 9,791 tokens
First Bike Ride (Article) – 492 tokens
Afsans (Book) – 26,556 tokens
Uray (Book) – 32,258 tokens
Fourth MTB MLE Conference Report (Translated) – 2,009 tokens
Universal Declaration of Human Rights (Translated) – 4,052 tokens
Cleaning and Normalization
UTF-8 encoding
Unicode normalization
Whitespace and punctuation cleanup
Removal of stray symbols and markup
Sample Text
رسالہ کھوارنامہ تان وختو نویوکووا مقبول شونین باشییاک اشٹوک رو
بابافتاح الدینو نامہ" نمبر"شائع کویان تودی ہتوباراکیاغ نیویشے ریکودونی ماضیہ بغاتام۔
افریقوتین بوغاوا اوا جہازو اوچے سمندری زندگیو بارا بو اشناریان سار خبر ہوتام۔
روبن سن کروسو ای جہازران موش اوشوئے۔
ای خطرناک طوفانار اچی ہو جہاز چھیوران وا ہورو سف ملگیری بریکو ہیس غیژی اوسنئیے ای ساحلہ تاروران۔
زباناں دے حوالے نال گندھارا ہندکو بورڈ پاکستان دا بیانیہ
تمام انسانان آزاد وا حقو اوچے عزتو لحاظا برابر پیدا بیتی اسونی،
وا ہیتانتین عقل اوچے ضمیر دیونو بیتی شیر ہیغین دیتی
ہیت تان موژی ایوال ایوالیو سوم ݰئیلی سلوک (بھائی چارگی) قائم کوریلیک۔