Anjuman-e-Katib Farsi/Persian Literature Corpus

License:

CC-BY-NC-4.0

Steward:

Anjuman e Katib

Task: NLP

Release Date: 12/3/2025

Format: TXT

Size: 2.82 MB

Description

This corpus is a collection of more than one million tokens of Farsi/Persian language. The corpus contains work of literature including novels, fictional and non-fictional, poems and many more. the data is shared with the approval of the authors and aims to support linguistic research, language technology development and its preservation.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

The data cannot be used by an organization having annual revenue more than one million USDs

Forbidden Usage

Generating, promoting, or distributing hate speech, mis information, or culturally offensive content. commercial or for-profit projects without explicit permission from the dataset creator. Any use that misrepresents or distorts Persian literature or the works contained within.

Processes

Ethical Review

The Dataset was curated from publicly available or author-shared Persian literary sources under ethical self-review by Anjuman-e-Katib. There is no sensitive or copyrighted material in this corpus. The collection aligns with CC-BY-NC-4.0 and principles of cultural respect, transparency, and non-commercial research use.

Intended Use

This Dataset is intended for research, education, and non-commercial use in NLP, computational linguistics, and digital humanities. It may be used for training, fine-tuning, or evaluating models for Persian language processing, and for linguistic and literary analysis supporting cultural preservation.

Metadata

Language

Farsi, also known as Persian, is an Indo-European language primarily spoken in Iran, Afghanistan, Tajikistan and Pakistan. Farsi is the official language of Iran. It is one of the major languages throughout the world. the language is written in two scripts. one is the Arabic script, and the other one is Cyrillic script which is used in Tajikistan adopted in 1939 from the Russian script.

Contents of the Corpus

The corpus contains the following books:

17 Qanoon-e-Muafaqiyat
Ishq Aab-e-Hayat Ast
Qullaha wa Darraha
Ilm Cheest, Falsafa Cheest
Sarzamin-e-Jamila
Afsanahai Karizma
Arabai Khudayan
Gurghai Dundar
Insan Dar Justujoi Maana
Masnawi-e-Ma‘nawi
Nahjul Balagha
Razhai Rawan Shinasi Tareekh
Taali

List of Alphabets

ا , ب , پ , ت , ث , ج , چ , ح , خ , د , ذ, ر , ز , ژ , س , ش , ص , ض , ط , ظ , ع , غ , ف, ق , ک , گ , ل , م , ن , و , ه , ی

Sample Text

پرويز، جوان بيست و دو سالهی با جثهی نحيف و لاغر، با موهای ژوليده و ريش نا تراشيده كه زيبايی صورتش را كم میكرد و تا حدی بد مینماياند- بر روی سنگ بزرگی پا گذاشت كه در كنار رود تن از آب بيرون كرده بود و نيمهای بيشترش را بالاتر از سطح رودخانه به باد و باران سپرده بود. او ندانسته و ناخواسته برجای پای تيمورشاه- كه ششصد سال پيش با لشكرش به چغچران رسيده بود و بر سر همين سنگ ايستاده بود و از آنجا به دژ تسخير ناپذيرش نگاه كرده بود- ايستاد؛ و مانند تيمورشاه- كه دستشرا سايهبان چشمها كرده بود- به چغچران و حومهاش نگاه كرده بود- دستش را سايهبان چشمها كرد و از فراز سنگ بهسوی كوههای سر بهفلك كشيدهای نگاه كرد، كه گردا گرد چغچران را مانند زنجيری بههم پيوسته، احاطه كرده بود و هيچگاهی نخميده بود و با تن استوار سالها را پشت سر گذاشته بود و بدون هراس از باد، باران، برف، توفان و لشكر كشیها آمد و رفت حيات را نظاره كرده بود.