Persian Literature Corpus by Najwai Sukhan
License:
CC-BY-NC-4.0
Steward:
Anjuman e KatibTask: NLP
Release Date: 3/23/2026
Format: TXT
Size: 38.62 MB
Share
Description
The Persian Literature Corpus by Najwai Sukhan is a curated collection of Persian (Farsi) literary and educational texts created for research, computational use, and cultural preservation. It contains about 1.26 million tokens across 20 complete works spanning classical literature, poetry, modern prose, educational writing, philosophy, translations, and culturally rooted creative texts. Originally compiled in Microsoft Word format, the corpus was cleaned, normalized, and converted into UTF-8 plain text while preserving original orthography and style. Each file represents a complete work, making the dataset useful for both individual text analysis and broader corpus-level study. The corpus supports corpus linguistics, literary studies, digital humanities, NLP, and Persian language preservation.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
Users must properly attribute Najwai Sukhan, may not redistribute or alter the textual content without clear citation, and must comply with all applicable copyright regulations.
Forbidden Usage
- Hate speech generation - Political propaganda or manipulation - Cultural misrepresentation - Disinformation campaigns - Unauthorized commercial redistribution
Processes
Ethical Review
The corpus has been reviewed to ensure: - Preservation of linguistic and cultural authenticity - Respectful representation of literary works - Integrity of original orthography - Responsible research usage Users are encouraged to conduct independent ethical review in accordance with institutional research standards.
Intended Use
- Academic linguistic research - Persian literary studies - NLP and language modeling - Tokenization and morphological analysis - Cultural and historical research
Metadata
Corpus Overview
Total Tokens: ~1.26M (1,260,397 tokens)
Total Files: 20
Writing Direction: Right-to-left
Encoding Standard: UTF-8
Language Family: Indo-European → Indo-Iranian → Iranian → Western Iranian
Domains of the Text
Literature (Creative writing)
Poetry (Aesthetic / cultural expression)
Folklore & Oral Tradition (Textual form)
Everyday Social Themes (as reflected in texts)
Cultural Knowledge & Heritage
Dataset Structure
The dataset contains 20 files.
Each file name matches the content inside (e.g., Story.txt, poem.txt, book.txt, translations.txt).
Treat each file as a separate genre/domain container.
Suitable for corpus linguistics, NLP pipelines, and literary analysis.
Data Format
Original: 20 files in Microsoft Word Document format (.doc/.docx)
Cleaned: Same 20 files after normalization (UTF-8 plain text)
Unicode normalization applied
White-space and punctuation cleanup
Removed stray symbols and markup
Total Tokens: ~1.26M (1,260,397 tokens)
Cleaning (Clean Layer)
UTF-8 encoding standardization
Unicode normalization
White-space normalization
Punctuation cleanup
Removal of stray symbols and formatting artifacts
No alteration of original textual content
File-Level Metadata
01-اثر مرکب_Asar-e-Murakkab by Daren Hardy - 59492.txt
02-ارینب دختر زیبای عرب_Arinab, Dukhtar-e-Zebai Arab by Abd-ur-Rahim - 6325.txt
03-ازدواج موفق_Izdiwaj-e-Muafaq by Fatima Shoaibi - 24029.txt
04-اسرار ذهن ثروتمند_Israr-e-Zehn-e-Siratmannd by Theharokar - 49847.txt
05-تلک خرس_Tilk-e-Khers by Muhammad Yosuf & Mark Edkin - 101956.txt
06-چهل گل پرپر_Chehl Gul-e-Parpar by Yonus Ibrahimi - 105538.txt
07-خود آگاهی_Khud Agahi by Osho - 4803.txt
08-خودشناسـي برائی خود خود سازی_Khudshinasi Barai Khudsazi by Muhammad Taqi Misbah Yazdi - 29892.txt
09-خود شناسی-خدا شناسی_Khudshinasi Khuda Shinasi - 10494.txt
10-روشﻫﺎي ﻧﻮﯾﻦ ﺗﺪرﯾﺲ_Rawishhai Nawin-e-Tadris by Zahra Muqarib - 7834.txt
11-عادتهای اتمی_Adathai Atomi by James Kler - 61535.txt
12-مادر_Madar by Kareem Natozi - 10848.txt
13-ﻣﻌﺟزه ﺷﮑرﮔزاری_Moujiza-e-Shukrguzari by Randa Brain - 41532.txt
14-ھدف_Hadaf by Brain Tresy - 8284.txt
15-وجودِ خُدا_Wajood-e-Khuda - 22696.txt
16-عِشق در راہ_Ishq Dar Rah by Kazeem Rashidi - 103500.txt
17-ملکہ_Malaka by Albert Green - 107479.txt
18-اساس آموزش انقلاب_Asas-e-Amozish Inqlab by Ahsan Tabree - 51639.txt
19-توانِ حال_Tawan-e-Hal by Echward Dolle - 70135.txt
20-کُلیات مولانا رومی_Kulliyat by Molana Rumi - 382539.txt
Sample Text
مىشود و انسان حقيقت خويش را بى پرده مشاهده مىكند، در اين جا منظور
آﻣﻮزش ﯾﮑﯽ از ﻣﺴﺎﺋﻞ ﺑﺴﯿﺎر ﻣﻬﻢ ﻧﻈﺎم ﻫـﺎي ﺗﻌﻠـﯿﻢ و ﺗﺮﺑﯿـﺖ اﺳـﺖ. ﻣﻨﻈـﻮر از آﻣـﻮزش، ﻓﺮآﯾﻨـﺪ دو ﺳـﻮﯾﻪ ﯾﺎددﻫﯽ ـ ﯾﺎدﮔﯿﺮي اﻃﻼﻋﺎت، ﻣﻬﺎرت ﻫﺎ و ﻧﮕﺮﺷﻬﺎي ﻣﺜﺒﺖ درﺑﺎره ﻣﻮﺿﻮﻋﯽ اﺳـﺖ ﮐـﻪ ﻣﺘﻨﺎﺳـﺐ ﺑـﺎ ﮔـﺮوه ﺳـﻨﯽ ﺧﺎص و در ﺷﺮاﯾﻂ زﻣﺎﻧﯽ ﻣﻌﯿﻦ ﺑﻪ اﺟﺮا در آﻣﺪه اﺳﺖ.
اگر خدای متعال موی پلک و بدن و ابرو را مثل موی ریش، سر وغیره اعضای معین همیشـه رشد میداد، آیا در کوتاه و بلند کردن و مود کشـیدن آنهـا به چه تکلیف روبرو میبودیم؟
ﻣﯽ ﺣﺮف ﻣﺎن ﺑﺎﻃﻨﯽ ﻣﯿﻞ و دروﻧﯽ ﻧﮕﺮش ﺑﺮاﺳﺎس ﺑﭽﮕﯽ زﻣﺎن در ﮐﻪ اﺳﺖ اﯾﻦ ﺻﺪا از ﻣﻨﻈﻮرم .داﺷﺘﯿﻢ را ﺧﻮد ﺻﺪای اﺑﺘﺪا در ﻣﺎ ﻫﻤﻪ
تمام این افراد، تیمها و شرکتهایی که در مورد آنها صحبت کردیم، با موقعیتهای متفاوتی روبرو فرزندان کوهسار چون پلنگانآزاد از صخرهای به صخرهای صعود میکنند و چون عقابان بلندپرواز چشمانداز بیکران صحرایی را که دِینو میگوید درهصوف در انتهای آن