Aim Foundation Dari Literature Corpus
License:
CC-BY-NC-4.0
Steward:
Aim Foundation
Task: NLP
Release Date: 12/3/2025
Format: TXT
Size: 1.74 MB
Description
This corpus is a collection of more than seven hundred thousand tokens of Dari language. The corpus contains work of literature including poems, stories, novels, fictional and non-fictional and different articles. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development and cultural preservation.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
The data cannot be used by an organization having annual revenue more than one million USDs.
Forbidden Usage
Generating, promoting, or distributing hate speech, misinformation, or culturally offensive content. Commercial or for-profit projects without explicit permission from the dataset creator. Any use that misrepresents or distorts Punjabi literature or the works contained within.
Processes
Ethical Review
The dataset was curated from publicly available or author-shared Dari literary sources under ethical self-review by Ain foundation. There is no sensitive or copyrighted material in this corpus. The collection aligns with CC-BY-NC-4.0 and principles of cultural respect, transparency, and non-commercial research use.
Intended Use
This dataset is intended for research, education, and non-commercial use in NLP, computational linguistics, and digital humanities. It may be used for training, fine-tuning, or evaluating models for Dari language language processing, and for linguistic and literary analysis supporting cultural preservation.
Metadata
Language
Dari is a variety of the Persian language spoken in Afghanistan, where it is one of two official languages alongside Pashto. It is also known as Afghan Persian or Eastern Persian and is closely related to other Persian dialects, particularly Tajik Persian. Dari is written in a modified Arabic script, functions as a lingua franca in Afghanistan, and is also spoken by communities in neighboring countries like Iran and Pakistan.
Content of the Corpus
the corpus contains following books from multiple authors of Dari language
Chehl Dukhtaran
Dukhyar E Wazir
Dilbar E Wahshi
Dukhtar E Wazir
Ghazal E Ghazal
Ishq Wa Intihar
Murwarid Dar Murdab
Padar E Poldar
Raqs Dar Masjid
Tikkahai Az Yak Gull E Mujasam
Qudrat E Hal
Adam Ba Mazluniyat Sazawar
Afghanistan Dar Ter Ras
Az Yad Beburdand
Bunyad E Amoozish E Inqilabi
Chashm Dar Chashme Tarikh
Insan Wa Haiwan
Ishtibahat E Amanullah Khan
Maa Wa Insan Dosti
Qudrat E Hal
Ranjhai In Rozhai Zaranj
Roomaan
Roz E Farhang E Hazara
Turkiya Dar Guzare Digital
Yadgar E Az Nasl Kushi Hazara
Zaban Ba Masabahe Ebraz
List of alphabets
ا آ ب پ ت ث ج چ ح خ د ذ ر ز ژ س ش ص ض ط ظ ع غ ف ق ک گ م ن و ه ی
Sample Text
فلسفهٔ مارکسيستی و مختصات آن فلسـفه و از آن جمله فلسـفهٔ مارکسيسـتی شـکل خاصی از شـعور )و يا آگاهی( اجتماعی اسـت که عامترين قانونمندیهای جهان هسـتی و معرفت انسـانی و رابطهٔ بين هسـتی و تفکر را بررسی میکند. فلسفه عامترين روابط و مناسـبات اشـياء و پديدهها را که در همهٔ انواع و اقسام عرصههای واقعيت بروز میکند مورد بررسی قرار میدهد. فلسفه نسبت به علوم ديگر در حکم اسلوب و متدولوژی عام آنهاست. موضوع فلسـفه در سـير تکامل تاريخ عوض شده اسـت. نخست فلسفه بـه مثابـه علم علـوم و جامع کل معارف بشـری بود. سـپس بـه تدريج علوم طبيعـی ماننـد فيزيـک و شـيمی و طبيعيات و غيـره از آن تفکيک شـد. آنگاه علـوم اجتماعی نيـز هر يک به مثابه علم مسـتقل و جداگانهای از آن انفکاک يافتند. ولی بر خلاف دعوی پوزيتويسـتها که میگويند ديگر برای فلسـفه جائی نمانده و اين رشـته از معرفت ديگر توخالی اسـت و يا آنکه فوقش بايد بـه بحـث های منطقی ـ زبانی بپردازد، فلسـفه چنانکه گفتيم به عنوان مدخل اسـلوبی بر علوم )اعم از علوم طبيعی و اجتماعی( جائی بسـيار مهم و بالا و ضرور دارد. البته آن فلسفهای که از علوم برخيزد و به علوم مدد رساند و آن
