Dari Literature Corpus by Anjuman e Adabi Nayestan
License:
CC-BY-NC-4.0
Steward:
Aim FoundationTask: NLP
Release Date: 3/5/2026
Format: TXT
Size: 12.67 MB
Share
Description
The Dari Literature Corpus (Anjuman e Adabi Nayestan) is a curated collection of written Dari (Afghan Persian) literary texts totaling about 1 million tokens. It includes prose, poetry, folklore-inspired narratives, and other culturally significant writings from both contemporary and classical traditions. The texts were collected in Microsoft Word and converted into UTF-8 normalized plain text for computational and linguistic research, including corpus linguistics, digital humanities, and NLP.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
This dataset is provided for research, educational, and non-commercial purposes only. Proper acknowledgment of the Anjuman e Adabi Nayestan is required in any publication or derivative work.
Forbidden Usage
- The dataset must not be used for generating harmful, hateful, or discriminatory content. - It must not be used for surveillance, profiling, or targeting individuals or communities. - Users must not misrepresent authorship or cultural context of the texts. - Redistribution for commercial gain without permission is prohibited.
Processes
Ethical Review
The corpus consists of literary and cultural texts intended for public readership. No personal, private, or sensitive data was intentionally collected. The dataset was prepared with care to preserve linguistic authenticity while ensuring suitability for academic and research use. Users are expected to follow ethical research standards and respect the cultural and literary integrity of the materials.
Intended Use
This dataset is intended for: - Linguistic and corpus-based research - Literary and cultural studies - Natural Language Processing (NLP) - Language modeling and machine translation research - Digital preservation of Dari literary heritage - Academic and educational purposes
Metadata
Language
Dari (دری), also known as Afghan Persian or Eastern Persian, is a Northwestern Iranian language and one of the two official languages of Afghanistan. It serves as a national lingua franca and shares high mutual intelligibility with Iranian Persian and Tajiki Persian.
Writing Script:
Perso-Arabic script used in Afghan Persian literary and academic writing.
Source / Publisher
Anjuman e Adabi Nayestan
Data Format
Originally: Microsoft Word documents (.doc/.docx)
Cleaned Version: Plain text (.txt)
UTF-8 encoded
Unicode normalized
Domains of the Text
Literature (creative writing)
Poetry
Folklore and narrative texts
Cultural and social themes
Classical and modern literary works
Corpus Overview
This dataset contains approximately ~1 million tokens of written Dari literary text compiled into 25 individual files. Each file represents a standalone literary work or thematic collection.
Dataset Structure
Total files: 25
Each file treated as a separate genre or literary container.
Original format: Microsoft Word documents
Cleaned format: UTF-8 plain text files
Dari Script
ا آ ب پ ت ث ج چ ح خ د ذ ر ز ژ س ش ص ض ط ظ ع غ ف ق ک گ م ن و ه ی
File-level Metadata
01-آرایشها و پیرایشهای ادبی سخن | Arayishha wa Perayishhai Adabi Sukhan by Hanif Bakhtyari - 39639.txt
02-بی تو یک لحظه زندگی | Be Tu Yak Lahza Zindagi by Hanif Bakhtyari - 22923.txt
03-پس از شانزده ماه آوارگی | Pas az Shanzdah Mah Awaragi by Hanif Bakhtyari - 25650.txt
04-ترنُم غزل | Tarnum-e-Ghazal by Hanif Bakhtyari - 31311.txt
05-چکید ه ی خیال | Chakida-ye Khiyal by Hanif Bakhtyari - 14218.txt
06-حدیث آرزو | Hadis-e-Arzoo by Hanif Bakhtyari - 16611.txt
07-حکایتهای عبرتی | Hikayat-haye Ibrati by Hanif Bakhtyari - 121013.txt
08-درکویرهملاله می روید | Dar Kawir Ham Lala Merooyad by Hanif Bakhtyari - 35661.txt
09-دستور کلاسیک زبان دری | Dastoor-e-Klasik Zaban-e-Dari by Hanif Bakhtyari - 21620.txt
10-رازها و روزها | Razha wa Rozha by Hanif Bakhtyari - 20556.txt
11-رنج | Ranj by Richard Calm - 211967.txt
12-رنگ زندگي يا زنگ زندگي | Rang-e Zendagi ya Zang-e Zendagi by Hanif Bakhtyari - 18375.txt
13-زنبق كوهي | Zanbaq-e Kohi by Hanif Bakhtyari - 11117.txt
14-ساحل لا حاصل | Sahil-e La Hasil by Farzad - 90026.txt
15-سایه در جنگل | Saya dar Jangal by Hanif Bakhtyari - 17063.txt
16-غروب انصاف | Ghuroob-e Ensaf by Hanif Bakhtyari - 17348.txt
17-فرياد نا شنيده | Faryad-e Na Shenida by Hanif Bakhtyari - 16273.txt
18-قراضه دو | Qaraza-ye Do by Hanif Bakhtyari - 22694.txt
19-قراضه یک | Qaraza-ye Yak by Hanif Bakhtyari - 24438.txt
20-كرانه سبز | Karana-ye Sabz by Hanif Bakhtyari - 13649.txt
21-لحظه ها میمیرند مکس | Lahza-ha Mimirand Max by Hanif Bakhtyari - 44337.txt
22-لحظه ها میمیرند | Lahza-ha Mimirand by Hanif Bakhtyari - 23654.txt
23-مکان در لا مکان | Makan Dar La Makan by Nawid Fidayee - 99722.txt
24-وفریاد سهم ماست | Wa Faryad Sahm-e Maast by Hanif Bakhtyari - 16262.txt
25-یک سبد غزل | Yak Sabad Ghazal by Hanif Bakhtyari - 30310.txt
Cleaning and Processing
UTF-8 encoding
Unicode normalization
Whitespace and punctuation cleanup
Removal of stray symbols and formatting artifacts
No semantic alteration of text
Sample Text
محمد حنيف بختياري فرزند مرحوم حاج كلبي حسن وكيل بعد از ظهر یکی از روزهای گرم سپتامبر بود. تالی، جنیفر و ژولی در خانه خیابان سان ست کورت دور میز آشپزخانه نشسته بودند. امشب هوای وصلت تو در سرم زده تیری ز عشق تُُست كه اندر پرم زده مده آزار كه از پرده بیرون افتد كا ر غصه كمتر ده كه از شكوه مرا آيد عار زندگي را چون چراغي یافتم مست ياش مستي ایاغي یافتم این معما حل نشد آسان مرا