Dari Literature Corpus by Anjuman e Adabi Nayestan

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

Aim Foundation

Task: NLP

Release Date: 3/5/2026

Format: TXT

Size: 12.67 MB


Share

Description

The Dari Literature Corpus (Anjuman e Adabi Nayestan) is a curated collection of written Dari (Afghan Persian) literary texts totaling about 1 million tokens. It includes prose, poetry, folklore-inspired narratives, and other culturally significant writings from both contemporary and classical traditions. The texts were collected in Microsoft Word and converted into UTF-8 normalized plain text for computational and linguistic research, including corpus linguistics, digital humanities, and NLP.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

This dataset is provided for research, educational, and non-commercial purposes only. Proper acknowledgment of the Anjuman e Adabi Nayestan is required in any publication or derivative work.

Forbidden Usage

- The dataset must not be used for generating harmful, hateful, or discriminatory content. - It must not be used for surveillance, profiling, or targeting individuals or communities. - Users must not misrepresent authorship or cultural context of the texts. - Redistribution for commercial gain without permission is prohibited.

Processes

Ethical Review

The corpus consists of literary and cultural texts intended for public readership. No personal, private, or sensitive data was intentionally collected. The dataset was prepared with care to preserve linguistic authenticity while ensuring suitability for academic and research use. Users are expected to follow ethical research standards and respect the cultural and literary integrity of the materials.

Intended Use

This dataset is intended for: - Linguistic and corpus-based research - Literary and cultural studies - Natural Language Processing (NLP) - Language modeling and machine translation research - Digital preservation of Dari literary heritage - Academic and educational purposes

Metadata

Language

Dari (دری), also known as Afghan Persian or Eastern Persian, is a Northwestern Iranian language and one of the two official languages of Afghanistan. It serves as a national lingua franca and shares high mutual intelligibility with Iranian Persian and Tajiki Persian.

Writing Script:
Perso-Arabic script used in Afghan Persian literary and academic writing.

Source / Publisher

Anjuman e Adabi Nayestan

Data Format

  • Originally: Microsoft Word documents (.doc/.docx)

  • Cleaned Version: Plain text (.txt)

  • UTF-8 encoded

  • Unicode normalized

Domains of the Text

  • Literature (creative writing)

  • Poetry

  • Folklore and narrative texts

  • Cultural and social themes

  • Classical and modern literary works

Corpus Overview

This dataset contains approximately ~1 million tokens of written Dari literary text compiled into 25 individual files. Each file represents a standalone literary work or thematic collection.

Dataset Structure

  • Total files: 25

  • Each file treated as a separate genre or literary container.

  • Original format: Microsoft Word documents

  • Cleaned format: UTF-8 plain text files

Dari Script

ا آ ب پ ت ث ج چ ح خ د ذ ر ز ژ س ش ص ض ط ظ ع غ ف ق ک گ م ن و ه ی

File-level Metadata

01-آرایشها و پیرایشهای ادبی سخن | Arayishha wa Perayishhai Adabi Sukhan by Hanif Bakhtyari - 39639.txt
02-بی تو یک لحظه زندگی | Be Tu Yak Lahza Zindagi by Hanif Bakhtyari - 22923.txt
03-پس از شانزده ماه آوارگی | Pas az Shanzdah Mah Awaragi by Hanif Bakhtyari - 25650.txt
04-ترنُم غزل | Tarnum-e-Ghazal by Hanif Bakhtyari - 31311.txt
05-چکید ه ی خیال | Chakida-ye Khiyal by Hanif Bakhtyari - 14218.txt
06-حدیث آرزو | Hadis-e-Arzoo by Hanif Bakhtyari - 16611.txt
07-حکایتهای عبرتی | Hikayat-haye Ibrati by Hanif Bakhtyari - 121013.txt
08-درکویرهملاله می روید | Dar Kawir Ham Lala Merooyad by Hanif Bakhtyari - 35661.txt
09-دستور کلاسیک زبان دری | Dastoor-e-Klasik Zaban-e-Dari by Hanif Bakhtyari - 21620.txt
10-رازها و روزها | Razha wa Rozha by Hanif Bakhtyari - 20556.txt
11-رنج | Ranj by Richard Calm - 211967.txt
12-رنگ زندگي يا زنگ زندگي | Rang-e Zendagi ya Zang-e Zendagi by Hanif Bakhtyari - 18375.txt
13-زنبق كوهي | Zanbaq-e Kohi by Hanif Bakhtyari - 11117.txt
14-ساحل لا حاصل | Sahil-e La Hasil by Farzad - 90026.txt
15-سایه در جنگل | Saya dar Jangal by Hanif Bakhtyari - 17063.txt
16-غروب انصاف | Ghuroob-e Ensaf by Hanif Bakhtyari - 17348.txt
17-فرياد نا شنيده | Faryad-e Na Shenida by Hanif Bakhtyari - 16273.txt
18-قراضه دو | Qaraza-ye Do by Hanif Bakhtyari - 22694.txt
19-قراضه یک | Qaraza-ye Yak by Hanif Bakhtyari - 24438.txt
20-كرانه سبز | Karana-ye Sabz by Hanif Bakhtyari - 13649.txt
21-لحظه ها میمیرند مکس | Lahza-ha Mimirand Max by Hanif Bakhtyari - 44337.txt
22-لحظه ها میمیرند | Lahza-ha Mimirand by Hanif Bakhtyari - 23654.txt
23-مکان در لا مکان | Makan Dar La Makan by Nawid Fidayee - 99722.txt
24-وفریاد سهم ماست | Wa Faryad Sahm-e Maast by Hanif Bakhtyari - 16262.txt
25-یک سبد غزل | Yak Sabad Ghazal by Hanif Bakhtyari - 30310.txt

Cleaning and Processing

  • UTF-8 encoding

  • Unicode normalization

  • Whitespace and punctuation cleanup

  • Removal of stray symbols and formatting artifacts

  • No semantic alteration of text

Sample Text

محمد حنيف بختياري فرزند مرحوم حاج كلبي حسن وكيل بعد از ظهر یکی از روزهای گرم سپتامبر بود. تالی، جنیفر و ژولی در خانه خیابان سان ست کورت دور میز آشپزخانه نشسته بودند. امشب هوای وصلت تو در سرم زده تیری ز عشق تُُست كه اندر پرم زده مده آزار كه از پرده بیرون افتد كا ر غصه كمتر ده كه از شكوه مرا آيد عار زندگي را چون چراغي یافتم مست ياش مستي ایاغي یافتم این معما حل نشد آسان مرا