Aim Foundation Dari Literature Corpus

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

Aim Foundation

Task: NLP

Release Date: 12/3/2025

Format: TXT

Size: 1.74 MB


Description

This corpus is a collection of more than seven hundred thousand tokens of Dari language. The corpus contains work of literature including poems, stories, novels, fictional and non-fictional and different articles. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development and cultural preservation.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

The data cannot be used by an organization having annual revenue more than one million USDs.

Forbidden Usage

Generating, promoting, or distributing hate speech, misinformation, or culturally offensive content. Commercial or for-profit projects without explicit permission from the dataset creator. Any use that misrepresents or distorts Punjabi literature or the works contained within.

Processes

Ethical Review

The dataset was curated from publicly available or author-shared Dari literary sources under ethical self-review by Ain foundation. There is no sensitive or copyrighted material in this corpus. The collection aligns with CC-BY-NC-4.0 and principles of cultural respect, transparency, and non-commercial research use.

Intended Use

This dataset is intended for research, education, and non-commercial use in NLP, computational linguistics, and digital humanities. It may be used for training, fine-tuning, or evaluating models for Dari language language processing, and for linguistic and literary analysis supporting cultural preservation.

Metadata

Language

Dari is a variety of the Persian language spoken in Afghanistan, where it is one of two official languages alongside Pashto. It is also known as Afghan Persian or Eastern Persian and is closely related to other Persian dialects, particularly Tajik Persian. Dari is written in a modified Arabic script, functions as a lingua franca in Afghanistan, and is also spoken by communities in neighboring countries like Iran and Pakistan.

Content of the Corpus

the corpus contains following books from multiple authors of Dari language

  1. Chehl Dukhtaran

  2. Dukhyar E Wazir

  3. Dilbar E Wahshi

  4. Dukhtar E Wazir

  5. Ghazal E Ghazal

  6. Ishq Wa Intihar

  7. Murwarid Dar Murdab

  8. Padar E Poldar

  9. Raqs Dar Masjid

  10. Tikkahai Az Yak Gull E Mujasam

  11. Qudrat E Hal

  12. Adam Ba Mazluniyat Sazawar

  13. Afghanistan Dar Ter Ras

  14. Az Yad Beburdand

  15. Bunyad E Amoozish E Inqilabi

  16. Chashm Dar Chashme Tarikh

  17. Insan Wa Haiwan

  18. Ishtibahat E Amanullah Khan

  19. Maa Wa Insan Dosti

  20. Qudrat E Hal

  21. Ranjhai In Rozhai Zaranj

  22. Roomaan

  23. Roz E Farhang E Hazara

  24. Turkiya Dar Guzare Digital

  25. Yadgar E Az Nasl Kushi Hazara

  26. Zaban Ba Masabahe Ebraz

List of alphabets

ا آ ب پ ت ث ج چ ح خ د ذ ر ز ژ س ش ص ض ط ظ ع غ ف ق ک گ م ن و ه ی

Sample Text

فلسفهٔ مارکسيستی و مختصات آن فلسـفه و از آن جمله فلسـفهٔ مارکسيسـتی شـکل خاصی از شـعور )و يا آگاهی( اجتماعی اسـت که عامترين قانونمندیهای جهان هسـتی و معرفت انسـانی و رابطهٔ بين هسـتی و تفکر را بررسی میکند. فلسفه عامترين روابط و مناسـبات اشـياء و پديدهها را که در همهٔ انواع و اقسام عرصههای واقعيت بروز میکند مورد بررسی قرار میدهد. فلسفه نسبت به علوم ديگر در حکم اسلوب و متدولوژی عام آنهاست. موضوع فلسـفه در سـير تکامل تاريخ عوض شده اسـت. نخست فلسفه بـه مثابـه علم علـوم و جامع کل معارف بشـری بود. سـپس بـه تدريج علوم طبيعـی ماننـد فيزيـک و شـيمی و طبيعيات و غيـره از آن تفکيک شـد. آنگاه علـوم اجتماعی نيـز هر يک به مثابه علم مسـتقل و جداگانهای از آن انفکاک يافتند. ولی بر خلاف دعوی پوزيتويسـتها که میگويند ديگر برای فلسـفه جائی نمانده و اين رشـته از معرفت ديگر توخالی اسـت و يا آنکه فوقش بايد بـه بحـث های منطقی ـ زبانی بپردازد، فلسـفه چنانکه گفتيم به عنوان مدخل اسـلوبی بر علوم )اعم از علوم طبيعی و اجتماعی( جائی بسـيار مهم و بالا و ضرور دارد. البته آن فلسفهای که از علوم برخيزد و به علوم مدد رساند و آن