Elkhani Hazargi Literature Corpus

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

Keblagh e Azergi

Task: NLP

Release Date: 3/5/2026

Format: TXT

Size: 2.46 MB


Share

Description

The Hazargi Literature Corpus (Keblagh e Azergi) is a monolingual literary dataset for documenting and supporting computational research on Hazargi (Hazaragi), an eastern Persian (Dari) dialect spoken by Hazara communities in Afghanistan and the diaspora. It contains 12 digitized works (prose, poetry, folklore, drama) converted from Word into UTF-8 normalized plain text while preserving original orthography and dialectal features. Total size: ~0.5M tokens (513,483).

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

- Attribution to corpus creators required. - Redistribution must preserve original linguistic integrity. - Commercial redistribution may require additional permission. - Proper academic citation mandatory.

Forbidden Usage

The dataset must NOT be used for: - Hate speech targeting Hazara communities - Ethnic profiling or surveillance - Political propaganda or disinformation - Biometric or identity tracking systems - Harmful manipulation technologies

Processes

Ethical Review

- Texts sourced from publicly available literary materials. - No private, biometric, or sensitive personal data included. - Reviewed to avoid confidential or unpublished material. - Cultural sensitivity maintained during digitization and normalization.

Intended Use

This corpus is intended for: - Academic linguistic research - Dialectology studies - Historical Persian analysis - NLP model training - Corpus linguistics - Cultural preservation - Educational purposes

Metadata

Language Information

Language

Hazargi (Dari Persian dialect group)

Language Family

Indo-European → Indo-Iranian → Iranian → Western Iranian → Persian (Dari variety)

Geographic Distribution

Primarily spoken in:

  • Afghanistan

  • Pakistan (Quetta)

  • Iran (Mashhad and other regions)

  • Uzbekistan

  • Tajikistan

  • Europe

  • Australia

  • The Americas

Estimated 399,000 Hazaragi-Dari speakers in Iran (2021 estimate).

Script Information

Hazargi Script (Modified Perso-Arabic)

آ ٬ ا ٬ ب ٬ پ ٬ ت ٬ ݖ ٬ ج ٬ چ ٬ خ ٬ د ٬ ۮ ٬ ر ٬ ز ٬ ژ٬ س ٬ ش ٬ غ٬ ٬ ف ٬ ق ٬ ک ٬ گ ٬ ل ٬ م ٬ ن ٬ و ٬ ۉ٬ ۆ٬ ی٬ ې ٬ ݷ ٬ ئ ٬ ۂ

  • Script direction: Right-to-left

  • Orthography: Perso-Arabic with Hazargi-specific letters

Domains of the Text

  • Literature (Creative writing)

  • Poetry (Aesthetic / cultural expression)

  • Folklore & Oral Tradition

  • Everyday Social Themes

  • Cultural Knowledge & Heritage

Dataset Structure & Processing

Dataset Structure

  • Total files: 12

  • Each file name matches its content

  • Each file treated as separate genre/domain container

  • Original format: Microsoft Word Documents

  • Cleaned format: UTF-8 normalized plain text (.txt)

File-Level Metadata

  • 01-قیسسای فۉلکولۉری آزرگی _Qissai folkloric-e-Azergi by Jibran - 50473.txt

  • 02-قئسته_Qasta (Poetry) - 22196.txt

  • 03-بئختی بیدار_Bakht-e- Bedar - folk - 74662.txt

  • 04-جامی تیلا_Jam-e-tilla by Farid - 30416.txt

  • 05-فلم نامه ناهید_Film Nama by Latif. H - 6538.txt

  • 06-خاشۂ سۆدیگئر_Khasha sawdigar by Salim Taban - 50798.txt

  • 07-کوی کی نفس میکشید_Koi-e- ki nafas miksha by M. Haidari-16491.txt

  • 08-نئقلای چوقنی_Naqlai chuqnai (folk) - 80655.txt

  • 09-وامی جئنگ_Wam-e Jang by Bashir Ghulami - 39859.txt

  • 10-قئستای دیل مئندیغو_Qastai dil mandighu by M. Elkhani - 20966.txt

  • 11-خوبی کو دۂ جولگه بیندئز_Khubi ku da julga bendaz by Ali Raza - 62572.txt

  • 12-نئقشی جادویی_Naqshey Jaduyi by Aziz Azra - 57857.txt

Cleaning (Clean Layer)

  • UTF-8 encoding

  • Unicode normalization

  • White-space normalization

  • Punctuation cleanup

  • Removal of stray symbols and markup

  • Preservation of original dialectal spellings

Additional Information

  • Resource Type: Monolingual Literary Corpus

  • Genre Coverage: Poetry, Folklore, Narrative, Drama

  • Dialect Coverage: Hazargi (Dari Persian variety)

  • Tokenization Unit: Word-level tokens

  • Total Files: 12

Sample Text

  • زئوار اۆقات شی تئلخ خاکی مالی رئسید دۂ پېشی امزو اوشیار ۔ اوشار تا سون زئوار تۉخ کئد فامید کی چیز تۉرۂ یۂ۔

  • دۂ آخیر خۉشبئختانۂ بݷنی ازیا اېد جئنگ نئشود و اردو تئرئف سولخ کئد۔

  • پیرمرد کنار دیوار ایستاده بود و سر تکان داد: «کوه همیشه می‌بیند، حتی وقتی ما نمی‌بینیم.» این جمله، با صداهای محیط ترکیب شد و حس حضور یک قاضی نامرئی را در ذهن مردم تداعی کرد.

  • چی خانۂ کی از قئسری شایی یئگۉ رئقئم کئمی نئدئشت۔ امو رئقئم دئب و دیستگا امو نوربئندی و آراییش۔ او قئسر رۂ پݷوئند غئریب خانۂ موگوفت بادشا کی دۂ اونجی پای خو اېشت بېخی اوش از سئر شی رافت۔ مېنی قئسر آر خېل او رئقئم مالوم موشود کی بوگی جای جئگې شایی بئشۂ۔

  • پیشی کی بېخی بئچے وئزیر پئسکی اماد گوفت دوختئر موگیۂ قئددی از مۂ قئد بادشا 4 رۉز مئندۂ کی پورۂ شونۂ دۂ اېنمزی چار رۉز بئلدې خئلاس کیدۉنی از مۂ ار کار کی میتنید کئنید اېنمی چار رۉر کی تېر شونۂ دیگۂ کار از کئردۉ تېر موشۂ.

Total Token Count

~0.5M tokens (513K tokens / 513,483 tokens)