Khowar Literature Corpus by FLI

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

Forum for Language Initiatives

Task: NLP

Release Date: 2/10/2026

Format: TXT

Size: 244.85 KB


Share

Description

The Khowar Literature Corpus by FLI is a curated multi-genre textual dataset consisting of 12 UTF-8 encoded text files with a total of 108K tokens. It includes literary works, poetry, folklore narratives, magazine editions, translated books, articles, reports, and official documents, and supports corpus linguistics, low-resource NLP, digital humanities research, and the preservation of Khowar linguistic and cultural heritage.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

This dataset is provided strictly for research, academic, educational, and non-commercial purposes. Proper citation and acknowledgment of the original authors, translators, publishers, and dataset curators are required.

Forbidden Usage

- You agree not to attempt to identify any individuals referenced in the dataset. - The dataset must not be used to infer personal, private, or sensitive information. - It is forbidden to use this dataset to train chatbots or large language models intended for commercial deployment. - Any use that misrepresents cultural, religious, or social contexts is prohibited.

Processes

Ethical Review

All data included in this corpus originates from published literary, educational, and publicly distributed sources. Permissions were obtained from original authors or translators where applicable. The dataset does not contain private or confidential information, and ethical standards regarding consent, attribution, and responsible reuse have been fully observed.

Intended Use

This dataset is intended for linguistic research, corpus analysis, low-resource NLP tasks, language modeling, educational use, and cultural and literary documentation of the Khowar language.

Metadata

Language Information

Khowar (کھووار), also known as Chitrali, is an Indo-Aryan language of the Dardic group spoken primarily in the Chitral region of Pakistan. It serves as the lingua franca of Chitral and is also spoken in Gilgit-Baltistan, Upper Swat, and by migrant communities in major urban centers such as Islamabad, Karachi, Lahore, and Peshawar. Khowar is also used as a second language by the Kalash people.

Domains of the Text

  • Literature (creative writing)

  • Poetry (aesthetic and cultural expression)

  • Folklore and oral tradition (textual form)

  • Everyday social themes

  • Cultural knowledge and heritage

  • Magazine corpus (two editions)

  • Translated informational and legal texts

Script Information

The corpus is written in the Perso-Arabic Khowar script and includes language-specific characters unique to Khowar orthography.

Khowar Alphabets

چ ج ث ٹ ت پ ب ۱
ڈ د ځ څ ݮ ݯ خ ح
ش س ݱ ژ ڑ ز ر ذ
ف غ ع ظ ط ض ص ݰ
ہ و ن م ل گ ک ق
ے ی ء

Khowar-specific characters:
ݰ ݱ ځ څ ݮ ݯ
(ݰاپیک، ݱانگ، ځوخ، څیق، ݮینݮیر، ݯونگو)

Dataset Structure and Processing

  • Total files: 12 text files

  • Total size: 108,537 tokens

  • Encoding: UTF-8

  • Each file represents a distinct genre or domain

  • File names match their internal content

  • Cleaned version included after Unicode normalization

File-Level Metadata

  1. Khowar Language Introduction – 1,519 tokens

  2. Khowar Nama Magazine (11th Edition, 2018) – 14,905 tokens

  3. Khowar Nama Magazine (12th Edition, 2019) – 16,500 tokens

  4. Nivishia Introduction – 236 tokens

  5. Qaso Kirdar – 166 tokens

  6. Khowar Masaan Naam – 53 tokens

  7. Robinson Crusoe (Translated) – 9,791 tokens

  8. First Bike Ride (Article) – 492 tokens

  9. Afsans (Book) – 26,556 tokens

  10. Uray (Book) – 32,258 tokens

  11. Fourth MTB MLE Conference Report (Translated) – 2,009 tokens

  12. Universal Declaration of Human Rights (Translated) – 4,052 tokens

Cleaning and Normalization

  • UTF-8 encoding

  • Unicode normalization

  • Whitespace and punctuation cleanup

  • Removal of stray symbols and markup

Sample Text

رسالہ کھوارنامہ تان وختو نویوکووا مقبول شونین باشییاک اشٹوک رو
بابافتاح الدینو نامہ" نمبر"شائع کویان تودی ہتوباراکیاغ نیویشے ریکودونی ماضیہ بغاتام۔
افریقوتین بوغاوا اوا جہازو اوچے سمندری زندگیو بارا بو اشناریان سار خبر ہوتام۔
روبن سن کروسو ای جہازران موش اوشوئے۔
ای خطرناک طوفانار اچی ہو جہاز چھیوران وا ہورو سف ملگیری بریکو ہیس غیژی اوسنئیے ای ساحلہ تاروران۔
زباناں دے حوالے نال گندھارا ہندکو بورڈ پاکستان دا بیانیہ
تمام انسانان آزاد وا حقو اوچے عزتو لحاظا برابر پیدا بیتی اسونی،
وا ہیتانتین عقل اوچے ضمیر دیونو بیتی شیر ہیغین دیتی
ہیت تان موژی ایوال ایوالیو سوم ݰئیلی سلوک (بھائی چارگی) قائم کوریلیک۔