Hussain Faizy Indus Kohistani Corpus
License:
CC-BY-SA-4.0
Steward:
Forum for Language Initiatives
Task: NLP
Release Date: 12/8/2025
Format: TXT
Size: 14.70 MB
Description
The Indus Kohistani corpus contains around 500k tokens of folktales, stories, poetry, biographies, and conversational texts, all transcribed with a consistent community orthography. Reviewed by native speakers, the corpus offers a representative snapshot of the language’s vocabulary and grammar for linguistic and computational research.
Specifics
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
This corpus may not be used by any organization with an annual revenue exceeding 1 million USD
Forbidden Usage
The corpus is strictly forbidden for use in creating synthetic text, generating hateful or harmful content, or developing tools that enable such outputs
Processes
Intended Use
This corpus is intended to support linguistic research, documentation, and community-centered technology development for Indus Kohistani. It is designed to help create educational resources, improve language understanding, and advance inclusive AI tools that benefit the speaker community.
Metadata
Language
Indus Kohistani (mvy) is an Indo-Aryan language spoken in the upper Indus Valley of northern Pakistan, primarily in Kohistan district. It is used across several villages and valleys, showing noticeable variation in pronunciation and vocabulary between communities. The language is rich in oral traditions, with folktales, poetry, and storytelling serving as important cultural practices. Although widely spoken, it remains under-documented and has limited written materials, making it an important language for linguistic research and resource development.
Content of the Corpus
Folktales and traditional narratives
Oral histories and storytelling
Poetry and songs
Children's stories
Biographies and life narratives
Conversational dialogues
Descriptive and explanatory texts
Proverbs and short sayings
Religious Literature
Processing
The processing will combine all plain text, PDF text, digits, and reference images into a clean, organized dataset. Text from PDFs will be extracted directly and standardized for Unicode, spacing, and orthography, while digits and symbols will be cleaned and formatted consistently. The images, which contain no text, will be stored as reference materials in a structured folder. The final output will include uniformly formatted UTF-8 text files and neatly organized reference images.
List of Alphabets
َ ُ ِ ّ ا ب پ ت ٹ ث چ ڇ څ ح خ د ڈ ذ ر ڑ ز ژ ڙ س ش ݜ ص ض ط ظ ع غ ف ق ک گ ل م ن ݨ و ہ ی
Sample Text
او ڙھا تیں بال لر نی تھی، نہ مہ زُنازُو تیں مُخالیۡفت کرم تُھو چے سِوَیں ژؤن٘دُناں مُختلف حالتیُوں مہ ادا کرَیں لاقَت لہ تَنہی نی ہُوئ تھی۔ دویُوں مِثال ݜے تھُو چے کماݜ لازمی کمہۡ (واجب) مُوڙ (چُن٘ڑ بول)، گُو (تھُل بول) یا
