Khowar Word List

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

Forum for Language Initiatives

Task: NLP

Release Date: 2/10/2026

Format: TXT

Size: 64.22 KB


Share

Description

The Khowar Word List and Alphabet Corpus is a curated lexical dataset of 22K tokens, organized into two UTF-8 encoded text files. It includes the full Khowar script (letters and language-specific characters) along with an extensive word list, and supports lexicography, morphological analysis, language documentation, and low-resource NLP tasks for Khowar.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

This dataset is provided for research, educational, and non-commercial purposes only. Users must provide proper attribution to the dataset creators and must not redistribute modified versions without acknowledgment.

Forbidden Usage

- Any attempt to associate words with identifiable individuals is prohibited. - The dataset must not be used to infer personal, sensitive, or private information. - Commercial exploitation without explicit permission is forbidden. - Misrepresentation of linguistic or cultural content is not allowed.

Processes

Ethical Review

The dataset consists exclusively of lexical items and script data and does not contain personal, sensitive, or confidential information. It was curated from linguistic resources intended for public, educational, and research use. Ethical standards regarding cultural respect, transparency, and responsible reuse have been fully observed.

Intended Use

This dataset is intended for use in lexicographic research, dictionary development, morphological analysis, language learning tools, spell-checking systems, and low-resource NLP applications for the Khowar language.

Metadata

Language Information

Khowar (کھووار), also known as Chitrali, is an Indo-Aryan language of the Dardic group spoken primarily in the Chitral region of Pakistan. It serves as the lingua franca of Chitral and is also spoken in parts of Gilgit-Baltistan, Upper Swat, and among migrant communities in major Pakistani cities. Khowar is written in a Perso-Arabic–based script with several language-specific characters.

Domains of the Data

  • Lexicography

  • Vocabulary documentation

  • Alphabet and script reference

  • Morphological and orthographic analysis

  • Language learning resources

  • NLP preprocessing (tokenization, normalization)

Script Information

The dataset uses the Perso-Arabic Khowar script, including standard letters and Khowar-specific extended characters.

Khowar Alphabets

ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ہ ی ے ء

Khowar-specific characters:
ݰ ݱ ځ څ ݮ ݯ
(ݰاپیک، ݱانگ، ځوخ، څیق، ݮینݮیر، ݯونگو)

Dataset Structure and Processing

  • Total files: 2

  • Total size: 22,207 tokens

  • Encoding: UTF-8

  • Each file represents a distinct lexical category

  • Cleaned and normalized for Unicode consistency

File-Level Metadata

  1. Khowar Letters – 66 tokens – TXT

  2. Khowar Words List – 22,141 tokens – TXT

Cleaning and Normalization

  • UTF-8 encoding

  • Unicode normalization

  • Whitespace cleanup

  • Removal of stray symbols

Sample Text

سالو
آؤرے
آئندوتے
آئین
آئیندو
آئینو
آئیو
آئیی
آئییو
آئے
آبادا