Kohistani Shina Word List

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

Forum for Language Initiatives

Task: NLP

Release Date: 2/10/2026

Format: TXT

Size: 394.05 KB


Share

Description

The Kohistani Shina Dictionary and Word List Corpus is a curated lexical dataset consisting of a single UTF-8 encoded text file with 154K tokens. It contains Kohistani Shina vocabulary entries with definitions, grammatical notes, and cross-linguistic references, primarily mapping Kohistani Shina words to Urdu. Kohistani Shina is a Dardic Indo-Aryan language spoken in the Indus Kohistan region of Pakistan, and this dataset supports lexicography, language documentation, dictionary building, morphological analysis, and low-resource NLP development.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

This dataset may be used for research, educational, and non-commercial purposes only. Proper attribution to the dataset creator and source is required. Redistribution of modified versions without acknowledgment is not permitted.

Forbidden Usage

- Attempting to associate lexical data with identifiable individuals is prohibited. - The dataset must not be used to infer personal, private, or sensitive information. - Commercial or for-profit use without explicit permission is forbidden. - Misrepresentation or distortion of linguistic or cultural content is not allowed.

Processes

Ethical Review

It was curated for academic and educational use with respect for linguistic and cultural integrity. Ethical standards of transparency, non-harm, and responsible data sharing have been fully observed.

Intended Use

This dataset is intended for lexicographic research, dictionary compilation, linguistic analysis, language learning resources, and NLP tasks such as tokenization, normalization, and morphological modeling for the Kohistani Shina language.

Metadata

Language Information

Kohistani Shina (ISO 639-3: plk) is a Dardic language within the Indo-Aryan family, spoken primarily in the Indus Kohistan region of Khyber Pakhtunkhwa, Pakistan, including Palas, Jalkot, and Seo. It is distinct from Gilgiti Shina due to phonological differences and is mutually intelligible with the Chilas variety. The language serves as a first language for the Shin community and is influenced by Indus Kohistani, Urdu, and English.

Domains of the Data

  • Lexicography

  • Vocabulary and dictionary documentation

  • Alphabet and script reference

  • Morphological and orthographic analysis

  • Language learning resources

  • NLP preprocessing

  • Kohistani Shina–Urdu dictionary development

Script Information

The dataset is written primarily in the Arabic-based script, using standard Urdu/Persian letters along with Kohistani Shina–specific consonants. A dedicated Kohistani Shina qaida (primer) was developed in 2021 to standardize orthography.

Kohistani Shina Alphabets

Standard Arabic / Persian / Urdu Base:
ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ہ ء ی ے

Unique Kohistani Shina Consonants:

  • څ (Tsa) – Voiceless dental affricate

  • ڇ (Chha) – Voiceless retroflex affricate

  • ڑ (Retroflex Z) – Voiced retroflex fricative

  • ݜ (Retroflex S) – Voiceless retroflex fricative

  • ݨ (Retroflex N) – Retroflex nasal

  • ڦ (Unique Kohistani Ph)

Dataset Structure and Processing

  • Total files: 1

  • Total size: 154,769 tokens

  • Encoding: UTF-8

  • File represents a standalone dictionary corpus

  • Cleaned and normalized for Unicode consistency

File-Level Metadata

  1. Kohistani Shina Dictionary (Short) – 154,769 tokens – TXT
    Author: Razwal Kohistani

Cleaning and Normalization

  • UTF-8 encoding

  • Unicode normalization

  • Whitespace and punctuation cleanup

  • Removal of stray symbols and markup

Sample Text

اَبَات (ݜ۔ا۔مث۔ص) سُستی۔کاہلی۔کام چوری۔
اَبَات گَر (ݜ۔ص) سُست۔ کاہل۔ بے ڈھنگ۔کام چور۔
اَبات گَروْ (ݜ۔مذ۔ص) دیکھیں اَبَات گَر۔
اَبَاتوْ (ݜ۔ا۔مذ۔ص) سُست۔ کاہل۔ بے ڈھنگ۔ بے ہنر۔{بروشسکی: اَبادو؛ کھوار: اَباتھَ}۔
اَبَاتیْ (ݜ۔ا۔مث۔ص) دیکھیں اباتوْ جس کی تانیث ہے۔
اُبَاش (ݜ۔ا۔مث۔و) اوجڑی۔ جانوروں کے معدے یا اوجڑی کا مواد۔ {بلوچی: واش؛ پشتو، کوہستانی: اُباش؛ توروالی: واش}۔