Elkhani Hazargi Literature Corpus

License:

CC-BY-NC-4.0

Steward:

Keblagh e Azergi

Task: NLP

Release Date: 3/5/2026

Format: TXT

Size: 2.46 MB

Description

The Hazargi Literature Corpus (Keblagh e Azergi) is a monolingual literary dataset for documenting and supporting computational research on Hazargi (Hazaragi), an eastern Persian (Dari) dialect spoken by Hazara communities in Afghanistan and the diaspora. It contains 12 digitized works (prose, poetry, folklore, drama) converted from Word into UTF-8 normalized plain text while preserving original orthography and dialectal features. Total size: ~0.5M tokens (513,483).

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

- Attribution to corpus creators required. - Redistribution must preserve original linguistic integrity. - Commercial redistribution may require additional permission. - Proper academic citation mandatory.

Forbidden Usage

The dataset must NOT be used for: - Hate speech targeting Hazara communities - Ethnic profiling or surveillance - Political propaganda or disinformation - Biometric or identity tracking systems - Harmful manipulation technologies

Processes

Ethical Review

- Texts sourced from publicly available literary materials. - No private, biometric, or sensitive personal data included. - Reviewed to avoid confidential or unpublished material. - Cultural sensitivity maintained during digitization and normalization.

Intended Use

This corpus is intended for: - Academic linguistic research - Dialectology studies - Historical Persian analysis - NLP model training - Corpus linguistics - Cultural preservation - Educational purposes

Metadata

Language Information

Language

Hazargi (Dari Persian dialect group)

Language Family

Indo-European → Indo-Iranian → Iranian → Western Iranian → Persian (Dari variety)

Geographic Distribution

Primarily spoken in:

Afghanistan
Pakistan (Quetta)
Iran (Mashhad and other regions)
Uzbekistan
Tajikistan
Europe
Australia
The Americas

Estimated 399,000 Hazaragi-Dari speakers in Iran (2021 estimate).

Script Information

Hazargi Script (Modified Perso-Arabic)

آ ٬ ا ٬ ب ٬ پ ٬ ت ٬ ݖ ٬ ج ٬ چ ٬ خ ٬ د ٬ ۮ ٬ ر ٬ ز ٬ ژ٬ س ٬ ش ٬ غ٬ ٬ ف ٬ ق ٬ ک ٬ گ ٬ ل ٬ م ٬ ن ٬ و ٬ ۉ٬ ۆ٬ ی٬ ې ٬ ݷ ٬ ئ ٬ ۂ

Script direction: Right-to-left
Orthography: Perso-Arabic with Hazargi-specific letters

Domains of the Text

Literature (Creative writing)
Poetry (Aesthetic / cultural expression)
Folklore & Oral Tradition
Everyday Social Themes
Cultural Knowledge & Heritage

Dataset Structure & Processing

Dataset Structure

Total files: 12
Each file name matches its content
Each file treated as separate genre/domain container
Original format: Microsoft Word Documents
Cleaned format: UTF-8 normalized plain text (.txt)

File-Level Metadata

01-قیسسای فۉلکولۉری آزرگی _Qissai folkloric-e-Azergi by Jibran - 50473.txt
02-قئسته_Qasta (Poetry) - 22196.txt
03-بئختی بیدار_Bakht-e- Bedar - folk - 74662.txt
04-جامی تیلا_Jam-e-tilla by Farid - 30416.txt
05-فلم نامه ناهید_Film Nama by Latif. H - 6538.txt
06-خاشۂ سۆدیگئر_Khasha sawdigar by Salim Taban - 50798.txt
07-کوی کی نفس میکشید_Koi-e- ki nafas miksha by M. Haidari-16491.txt
08-نئقلای چوقنی_Naqlai chuqnai (folk) - 80655.txt
09-وامی جئنگ_Wam-e Jang by Bashir Ghulami - 39859.txt
10-قئستای دیل مئندیغو_Qastai dil mandighu by M. Elkhani - 20966.txt
11-خوبی کو دۂ جولگه بیندئز_Khubi ku da julga bendaz by Ali Raza - 62572.txt
12-نئقشی جادویی_Naqshey Jaduyi by Aziz Azra - 57857.txt

Cleaning (Clean Layer)

UTF-8 encoding
Unicode normalization
White-space normalization
Punctuation cleanup
Removal of stray symbols and markup
Preservation of original dialectal spellings

Additional Information

Resource Type: Monolingual Literary Corpus
Genre Coverage: Poetry, Folklore, Narrative, Drama
Dialect Coverage: Hazargi (Dari Persian variety)
Tokenization Unit: Word-level tokens
Total Files: 12

Sample Text

زئوار اۆقات شی تئلخ خاکی مالی رئسید دۂ پېشی امزو اوشیار ۔ اوشار تا سون زئوار تۉخ کئد فامید کی چیز تۉرۂ یۂ۔
دۂ آخیر خۉشبئختانۂ بݷنی ازیا اېد جئنگ نئشود و اردو تئرئف سولخ کئد۔
پیرمرد کنار دیوار ایستاده بود و سر تکان داد: «کوه همیشه می‌بیند، حتی وقتی ما نمی‌بینیم.» این جمله، با صداهای محیط ترکیب شد و حس حضور یک قاضی نامرئی را در ذهن مردم تداعی کرد.
چی خانۂ کی از قئسری شایی یئگۉ رئقئم کئمی نئدئشت۔ امو رئقئم دئب و دیستگا امو نوربئندی و آراییش۔ او قئسر رۂ پݷوئند غئریب خانۂ موگوفت بادشا کی دۂ اونجی پای خو اېشت بېخی اوش از سئر شی رافت۔ مېنی قئسر آر خېل او رئقئم مالوم موشود کی بوگی جای جئگې شایی بئشۂ۔
پیشی کی بېخی بئچے وئزیر پئسکی اماد گوفت دوختئر موگیۂ قئددی از مۂ قئد بادشا 4 رۉز مئندۂ کی پورۂ شونۂ دۂ اېنمزی چار رۉز بئلدې خئلاس کیدۉنی از مۂ ار کار کی میتنید کئنید اېنمی چار رۉر کی تېر شونۂ دیگۂ کار از کئردۉ تېر موشۂ.

Total Token Count

~0.5M tokens (513K tokens / 513,483 tokens)