Hussain Faizy Indus Kohistani Corpus

License icon

License:

CC-BY-SA-4.0

Shield icon

Steward:

Forum for Language Initiatives

Task: NLP

Release Date: 12/8/2025

Format: TXT

Size: 14.70 MB


Description

The Indus Kohistani corpus contains around 500k tokens of folktales, stories, poetry, biographies, and conversational texts, all transcribed with a consistent community orthography. Reviewed by native speakers, the corpus offers a representative snapshot of the language’s vocabulary and grammar for linguistic and computational research.

Specifics

Licensing

Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)

https://spdx.org/licenses/CC-BY-SA-4.0.html

Considerations

Restrictions/Special Constraints

This corpus may not be used by any organization with an annual revenue exceeding 1 million USD

Forbidden Usage

The corpus is strictly forbidden for use in creating synthetic text, generating hateful or harmful content, or developing tools that enable such outputs

Processes

Intended Use

This corpus is intended to support linguistic research, documentation, and community-centered technology development for Indus Kohistani. It is designed to help create educational resources, improve language understanding, and advance inclusive AI tools that benefit the speaker community.

Metadata

Language

Indus Kohistani (mvy) is an Indo-Aryan language spoken in the upper Indus Valley of northern Pakistan, primarily in Kohistan district. It is used across several villages and valleys, showing noticeable variation in pronunciation and vocabulary between communities. The language is rich in oral traditions, with folktales, poetry, and storytelling serving as important cultural practices. Although widely spoken, it remains under-documented and has limited written materials, making it an important language for linguistic research and resource development.

Content of the Corpus

  • Folktales and traditional narratives

  • Oral histories and storytelling

  • Poetry and songs

  • Children's stories

  • Biographies and life narratives

  • Conversational dialogues

  • Descriptive and explanatory texts

  • Proverbs and short sayings

  • Religious Literature

Processing

The processing will combine all plain text, PDF text, digits, and reference images into a clean, organized dataset. Text from PDFs will be extracted directly and standardized for Unicode, spacing, and orthography, while digits and symbols will be cleaned and formatted consistently. The images, which contain no text, will be stored as reference materials in a structured folder. The final output will include uniformly formatted UTF-8 text files and neatly organized reference images.

List of Alphabets

َ ُ ِ ّ ا ب پ ت ٹ ث چ ڇ څ ح خ د ڈ ذ ر ڑ ز ژ ڙ س ش ݜ ص ض ط ظ ع غ ف ق ک گ ل م ن ݨ و ہ ی

Sample Text

او ڙھا تیں بال لر نی تھی، نہ مہ زُنازُو تیں مُخالیۡفت کرم تُھو چے سِوَیں ژؤن٘دُناں مُختلف حالتیُوں مہ ادا کرَیں لاقَت لہ تَنہی نی ہُوئ تھی۔ دویُوں مِثال ݜے تھُو چے کماݜ لازمی کمہۡ (واجب) مُوڙ (چُن٘ڑ بول)، گُو (تھُل بول) یا