Kohistani Shina Word List
License:
CC-BY-NC-4.0
Steward:
Forum for Language InitiativesTask: NLP
Release Date: 2/10/2026
Format: TXT
Size: 394.05 KB
Share
Description
The Kohistani Shina Dictionary and Word List Corpus is a curated lexical dataset consisting of a single UTF-8 encoded text file with 154K tokens. It contains Kohistani Shina vocabulary entries with definitions, grammatical notes, and cross-linguistic references, primarily mapping Kohistani Shina words to Urdu. Kohistani Shina is a Dardic Indo-Aryan language spoken in the Indus Kohistan region of Pakistan, and this dataset supports lexicography, language documentation, dictionary building, morphological analysis, and low-resource NLP development.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
This dataset may be used for research, educational, and non-commercial purposes only. Proper attribution to the dataset creator and source is required. Redistribution of modified versions without acknowledgment is not permitted.
Forbidden Usage
- Attempting to associate lexical data with identifiable individuals is prohibited. - The dataset must not be used to infer personal, private, or sensitive information. - Commercial or for-profit use without explicit permission is forbidden. - Misrepresentation or distortion of linguistic or cultural content is not allowed.
Processes
Ethical Review
It was curated for academic and educational use with respect for linguistic and cultural integrity. Ethical standards of transparency, non-harm, and responsible data sharing have been fully observed.
Intended Use
This dataset is intended for lexicographic research, dictionary compilation, linguistic analysis, language learning resources, and NLP tasks such as tokenization, normalization, and morphological modeling for the Kohistani Shina language.
Metadata
Language Information
Kohistani Shina (ISO 639-3: plk) is a Dardic language within the Indo-Aryan family, spoken primarily in the Indus Kohistan region of Khyber Pakhtunkhwa, Pakistan, including Palas, Jalkot, and Seo. It is distinct from Gilgiti Shina due to phonological differences and is mutually intelligible with the Chilas variety. The language serves as a first language for the Shin community and is influenced by Indus Kohistani, Urdu, and English.
Domains of the Data
Lexicography
Vocabulary and dictionary documentation
Alphabet and script reference
Morphological and orthographic analysis
Language learning resources
NLP preprocessing
Kohistani Shina–Urdu dictionary development
Script Information
The dataset is written primarily in the Arabic-based script, using standard Urdu/Persian letters along with Kohistani Shina–specific consonants. A dedicated Kohistani Shina qaida (primer) was developed in 2021 to standardize orthography.
Kohistani Shina Alphabets
Standard Arabic / Persian / Urdu Base:
ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ہ ء ی ے
Unique Kohistani Shina Consonants:
څ (Tsa) – Voiceless dental affricate
ڇ (Chha) – Voiceless retroflex affricate
ڑ (Retroflex Z) – Voiced retroflex fricative
ݜ (Retroflex S) – Voiceless retroflex fricative
ݨ (Retroflex N) – Retroflex nasal
ڦ (Unique Kohistani Ph)
Dataset Structure and Processing
Total files: 1
Total size: 154,769 tokens
Encoding: UTF-8
File represents a standalone dictionary corpus
Cleaned and normalized for Unicode consistency
File-Level Metadata
Kohistani Shina Dictionary (Short) – 154,769 tokens – TXT
Author: Razwal Kohistani
Cleaning and Normalization
UTF-8 encoding
Unicode normalization
Whitespace and punctuation cleanup
Removal of stray symbols and markup
Sample Text
اَبَات (ݜ۔ا۔مث۔ص) سُستی۔کاہلی۔کام چوری۔
اَبَات گَر (ݜ۔ص) سُست۔ کاہل۔ بے ڈھنگ۔کام چور۔
اَبات گَروْ (ݜ۔مذ۔ص) دیکھیں اَبَات گَر۔
اَبَاتوْ (ݜ۔ا۔مذ۔ص) سُست۔ کاہل۔ بے ڈھنگ۔ بے ہنر۔{بروشسکی: اَبادو؛ کھوار: اَباتھَ}۔
اَبَاتیْ (ݜ۔ا۔مث۔ص) دیکھیں اباتوْ جس کی تانیث ہے۔
اُبَاش (ݜ۔ا۔مث۔و) اوجڑی۔ جانوروں کے معدے یا اوجڑی کا مواد۔ {بلوچی: واش؛ پشتو، کوہستانی: اُباش؛ توروالی: واش}۔