Elkhani Hazargi Literature Corpus
License:
CC-BY-NC-4.0
Steward:
Keblagh e AzergiTask: NLP
Release Date: 3/5/2026
Format: TXT
Size: 2.46 MB
Share
Description
The Hazargi Literature Corpus (Keblagh e Azergi) is a monolingual literary dataset for documenting and supporting computational research on Hazargi (Hazaragi), an eastern Persian (Dari) dialect spoken by Hazara communities in Afghanistan and the diaspora. It contains 12 digitized works (prose, poetry, folklore, drama) converted from Word into UTF-8 normalized plain text while preserving original orthography and dialectal features. Total size: ~0.5M tokens (513,483).
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
- Attribution to corpus creators required. - Redistribution must preserve original linguistic integrity. - Commercial redistribution may require additional permission. - Proper academic citation mandatory.
Forbidden Usage
The dataset must NOT be used for: - Hate speech targeting Hazara communities - Ethnic profiling or surveillance - Political propaganda or disinformation - Biometric or identity tracking systems - Harmful manipulation technologies
Processes
Ethical Review
- Texts sourced from publicly available literary materials. - No private, biometric, or sensitive personal data included. - Reviewed to avoid confidential or unpublished material. - Cultural sensitivity maintained during digitization and normalization.
Intended Use
This corpus is intended for: - Academic linguistic research - Dialectology studies - Historical Persian analysis - NLP model training - Corpus linguistics - Cultural preservation - Educational purposes
Metadata
Language Information
Language
Hazargi (Dari Persian dialect group)
Language Family
Indo-European → Indo-Iranian → Iranian → Western Iranian → Persian (Dari variety)
Geographic Distribution
Primarily spoken in:
Afghanistan
Pakistan (Quetta)
Iran (Mashhad and other regions)
Uzbekistan
Tajikistan
Europe
Australia
The Americas
Estimated 399,000 Hazaragi-Dari speakers in Iran (2021 estimate).
Script Information
Hazargi Script (Modified Perso-Arabic)
آ ٬ ا ٬ ب ٬ پ ٬ ت ٬ ݖ ٬ ج ٬ چ ٬ خ ٬ د ٬ ۮ ٬ ر ٬ ز ٬ ژ٬ س ٬ ش ٬ غ٬ ٬ ف ٬ ق ٬ ک ٬ گ ٬ ل ٬ م ٬ ن ٬ و ٬ ۉ٬ ۆ٬ ی٬ ې ٬ ݷ ٬ ئ ٬ ۂ
Script direction: Right-to-left
Orthography: Perso-Arabic with Hazargi-specific letters
Domains of the Text
Literature (Creative writing)
Poetry (Aesthetic / cultural expression)
Folklore & Oral Tradition
Everyday Social Themes
Cultural Knowledge & Heritage
Dataset Structure & Processing
Dataset Structure
Total files: 12
Each file name matches its content
Each file treated as separate genre/domain container
Original format: Microsoft Word Documents
Cleaned format: UTF-8 normalized plain text (.txt)
File-Level Metadata
01-قیسسای فۉلکولۉری آزرگی _Qissai folkloric-e-Azergi by Jibran - 50473.txt
02-قئسته_Qasta (Poetry) - 22196.txt
03-بئختی بیدار_Bakht-e- Bedar - folk - 74662.txt
04-جامی تیلا_Jam-e-tilla by Farid - 30416.txt
05-فلم نامه ناهید_Film Nama by Latif. H - 6538.txt
06-خاشۂ سۆدیگئر_Khasha sawdigar by Salim Taban - 50798.txt
07-کوی کی نفس میکشید_Koi-e- ki nafas miksha by M. Haidari-16491.txt
08-نئقلای چوقنی_Naqlai chuqnai (folk) - 80655.txt
09-وامی جئنگ_Wam-e Jang by Bashir Ghulami - 39859.txt
10-قئستای دیل مئندیغو_Qastai dil mandighu by M. Elkhani - 20966.txt
11-خوبی کو دۂ جولگه بیندئز_Khubi ku da julga bendaz by Ali Raza - 62572.txt
12-نئقشی جادویی_Naqshey Jaduyi by Aziz Azra - 57857.txt
Cleaning (Clean Layer)
UTF-8 encoding
Unicode normalization
White-space normalization
Punctuation cleanup
Removal of stray symbols and markup
Preservation of original dialectal spellings
Additional Information
Resource Type: Monolingual Literary Corpus
Genre Coverage: Poetry, Folklore, Narrative, Drama
Dialect Coverage: Hazargi (Dari Persian variety)
Tokenization Unit: Word-level tokens
Total Files: 12
Sample Text
زئوار اۆقات شی تئلخ خاکی مالی رئسید دۂ پېشی امزو اوشیار ۔ اوشار تا سون زئوار تۉخ کئد فامید کی چیز تۉرۂ یۂ۔
دۂ آخیر خۉشبئختانۂ بݷنی ازیا اېد جئنگ نئشود و اردو تئرئف سولخ کئد۔
پیرمرد کنار دیوار ایستاده بود و سر تکان داد: «کوه همیشه میبیند، حتی وقتی ما نمیبینیم.» این جمله، با صداهای محیط ترکیب شد و حس حضور یک قاضی نامرئی را در ذهن مردم تداعی کرد.
چی خانۂ کی از قئسری شایی یئگۉ رئقئم کئمی نئدئشت۔ امو رئقئم دئب و دیستگا امو نوربئندی و آراییش۔ او قئسر رۂ پݷوئند غئریب خانۂ موگوفت بادشا کی دۂ اونجی پای خو اېشت بېخی اوش از سئر شی رافت۔ مېنی قئسر آر خېل او رئقئم مالوم موشود کی بوگی جای جئگې شایی بئشۂ۔
پیشی کی بېخی بئچے وئزیر پئسکی اماد گوفت دوختئر موگیۂ قئددی از مۂ قئد بادشا 4 رۉز مئندۂ کی پورۂ شونۂ دۂ اېنمزی چار رۉز بئلدې خئلاس کیدۉنی از مۂ ار کار کی میتنید کئنید اېنمی چار رۉر کی تېر شونۂ دیگۂ کار از کئردۉ تېر موشۂ.
Total Token Count
~0.5M tokens (513K tokens / 513,483 tokens)