Anjuman-e-Katib Farsi/Persian Literature Corpus
License:
CC-BY-NC-4.0
Steward:
Anjuman e Katib
Task: NLP
Release Date: 12/3/2025
Format: TXT
Size: 2.82 MB
Description
This corpus is a collection of more than one million tokens of Farsi/Persian language. The corpus contains work of literature including novels, fictional and non-fictional, poems and many more. the data is shared with the approval of the authors and aims to support linguistic research, language technology development and its preservation.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
The data cannot be used by an organization having annual revenue more than one million USDs
Forbidden Usage
Generating, promoting, or distributing hate speech, mis information, or culturally offensive content. commercial or for-profit projects without explicit permission from the dataset creator. Any use that misrepresents or distorts Persian literature or the works contained within.
Processes
Ethical Review
The Dataset was curated from publicly available or author-shared Persian literary sources under ethical self-review by Anjuman-e-Katib. There is no sensitive or copyrighted material in this corpus. The collection aligns with CC-BY-NC-4.0 and principles of cultural respect, transparency, and non-commercial research use.
Intended Use
This Dataset is intended for research, education, and non-commercial use in NLP, computational linguistics, and digital humanities. It may be used for training, fine-tuning, or evaluating models for Persian language processing, and for linguistic and literary analysis supporting cultural preservation.
Metadata
Language
Farsi, also known as Persian, is an Indo-European language primarily spoken in Iran, Afghanistan, Tajikistan and Pakistan. Farsi is the official language of Iran. It is one of the major languages throughout the world. the language is written in two scripts. one is the Arabic script, and the other one is Cyrillic script which is used in Tajikistan adopted in 1939 from the Russian script.
Contents of the Corpus
The corpus contains the following books:
17 Qanoon-e-Muafaqiyat
Ishq Aab-e-Hayat Ast
Qullaha wa Darraha
Ilm Cheest, Falsafa Cheest
Sarzamin-e-Jamila
Afsanahai Karizma
Arabai Khudayan
Gurghai Dundar
Insan Dar Justujoi Maana
Masnawi-e-Ma‘nawi
Nahjul Balagha
Razhai Rawan Shinasi Tareekh
Taali
List of Alphabets
ا , ب , پ , ت , ث , ج , چ , ح , خ , د , ذ, ر , ز , ژ , س , ش , ص , ض , ط , ظ , ع , غ , ف, ق , ک , گ , ل , م , ن , و , ه , ی
Sample Text
پرويز، جوان بيست و دو سالهی با جثهی نحيف و لاغر، با موهای ژوليده و ريش نا تراشيده كه زيبايی صورتش را كم میكرد و تا حدی بد مینماياند- بر روی سنگ بزرگی پا گذاشت كه در كنار رود تن از آب بيرون كرده بود و نيمهای بيشترش را بالاتر از سطح رودخانه به باد و باران سپرده بود. او ندانسته و ناخواسته برجای پای تيمورشاه- كه ششصد سال پيش با لشكرش به چغچران رسيده بود و بر سر همين سنگ ايستاده بود و از آنجا به دژ تسخير ناپذيرش نگاه كرده بود- ايستاد؛ و مانند تيمورشاه- كه دستشرا سايهبان چشمها كرده بود- به چغچران و حومهاش نگاه كرده بود- دستش را سايهبان چشمها كرد و از فراز سنگ بهسوی كوههای سر بهفلك كشيدهای نگاه كرد، كه گردا گرد چغچران را مانند زنجيری بههم پيوسته، احاطه كرده بود و هيچگاهی نخميده بود و با تن استوار سالها را پشت سر گذاشته بود و بدون هراس از باد، باران، برف، توفان و لشكر كشیها آمد و رفت حيات را نظاره كرده بود.
