Khowar Word List
License:
CC-BY-NC-4.0
Steward:
Forum for Language InitiativesTask: NLP
Release Date: 2/10/2026
Format: TXT
Size: 64.22 KB
Share
Description
The Khowar Word List and Alphabet Corpus is a curated lexical dataset of 22K tokens, organized into two UTF-8 encoded text files. It includes the full Khowar script (letters and language-specific characters) along with an extensive word list, and supports lexicography, morphological analysis, language documentation, and low-resource NLP tasks for Khowar.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
This dataset is provided for research, educational, and non-commercial purposes only. Users must provide proper attribution to the dataset creators and must not redistribute modified versions without acknowledgment.
Forbidden Usage
- Any attempt to associate words with identifiable individuals is prohibited. - The dataset must not be used to infer personal, sensitive, or private information. - Commercial exploitation without explicit permission is forbidden. - Misrepresentation of linguistic or cultural content is not allowed.
Processes
Ethical Review
The dataset consists exclusively of lexical items and script data and does not contain personal, sensitive, or confidential information. It was curated from linguistic resources intended for public, educational, and research use. Ethical standards regarding cultural respect, transparency, and responsible reuse have been fully observed.
Intended Use
This dataset is intended for use in lexicographic research, dictionary development, morphological analysis, language learning tools, spell-checking systems, and low-resource NLP applications for the Khowar language.
Metadata
Language Information
Khowar (کھووار), also known as Chitrali, is an Indo-Aryan language of the Dardic group spoken primarily in the Chitral region of Pakistan. It serves as the lingua franca of Chitral and is also spoken in parts of Gilgit-Baltistan, Upper Swat, and among migrant communities in major Pakistani cities. Khowar is written in a Perso-Arabic–based script with several language-specific characters.
Domains of the Data
Lexicography
Vocabulary documentation
Alphabet and script reference
Morphological and orthographic analysis
Language learning resources
NLP preprocessing (tokenization, normalization)
Script Information
The dataset uses the Perso-Arabic Khowar script, including standard letters and Khowar-specific extended characters.
Khowar Alphabets
ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ہ ی ے ء
Khowar-specific characters:
ݰ ݱ ځ څ ݮ ݯ
(ݰاپیک، ݱانگ، ځوخ، څیق، ݮینݮیر، ݯونگو)
Dataset Structure and Processing
Total files: 2
Total size: 22,207 tokens
Encoding: UTF-8
Each file represents a distinct lexical category
Cleaned and normalized for Unicode consistency
File-Level Metadata
Khowar Letters – 66 tokens – TXT
Khowar Words List – 22,141 tokens – TXT
Cleaning and Normalization
UTF-8 encoding
Unicode normalization
Whitespace cleanup
Removal of stray symbols
Sample Text
سالو
آؤرے
آئندوتے
آئین
آئیندو
آئینو
آئیو
آئیی
آئییو
آئے
آبادا