Persian Literature Corpus by Najwai Sukhan

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

Anjuman e Katib

Task: NLP

Release Date: 3/23/2026

Format: TXT

Size: 38.62 MB


Share

Description

The Persian Literature Corpus by Najwai Sukhan is a curated collection of Persian (Farsi) literary and educational texts created for research, computational use, and cultural preservation. It contains about 1.26 million tokens across 20 complete works spanning classical literature, poetry, modern prose, educational writing, philosophy, translations, and culturally rooted creative texts. Originally compiled in Microsoft Word format, the corpus was cleaned, normalized, and converted into UTF-8 plain text while preserving original orthography and style. Each file represents a complete work, making the dataset useful for both individual text analysis and broader corpus-level study. The corpus supports corpus linguistics, literary studies, digital humanities, NLP, and Persian language preservation.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

Users must properly attribute Najwai Sukhan, may not redistribute or alter the textual content without clear citation, and must comply with all applicable copyright regulations.

Forbidden Usage

- Hate speech generation - Political propaganda or manipulation - Cultural misrepresentation - Disinformation campaigns - Unauthorized commercial redistribution

Processes

Ethical Review

The corpus has been reviewed to ensure: - Preservation of linguistic and cultural authenticity - Respectful representation of literary works - Integrity of original orthography - Responsible research usage Users are encouraged to conduct independent ethical review in accordance with institutional research standards.

Intended Use

- Academic linguistic research - Persian literary studies - NLP and language modeling - Tokenization and morphological analysis - Cultural and historical research

Metadata

Corpus Overview

  • Total Tokens: ~1.26M (1,260,397 tokens)

  • Total Files: 20

  • Writing Direction: Right-to-left

  • Encoding Standard: UTF-8

  • Language Family: Indo-European → Indo-Iranian → Iranian → Western Iranian

Domains of the Text

  • Literature (Creative writing)

  • Poetry (Aesthetic / cultural expression)

  • Folklore & Oral Tradition (Textual form)

  • Everyday Social Themes (as reflected in texts)

  • Cultural Knowledge & Heritage

Dataset Structure

  • The dataset contains 20 files.

  • Each file name matches the content inside (e.g., Story.txt, poem.txt, book.txt, translations.txt).

  • Treat each file as a separate genre/domain container.

  • Suitable for corpus linguistics, NLP pipelines, and literary analysis.

Data Format

  • Original: 20 files in Microsoft Word Document format (.doc/.docx)

  • Cleaned: Same 20 files after normalization (UTF-8 plain text)

  • Unicode normalization applied

  • White-space and punctuation cleanup

  • Removed stray symbols and markup

  • Total Tokens: ~1.26M (1,260,397 tokens)

Cleaning (Clean Layer)

  • UTF-8 encoding standardization

  • Unicode normalization

  • White-space normalization

  • Punctuation cleanup

  • Removal of stray symbols and formatting artifacts

  • No alteration of original textual content

File-Level Metadata

  • 01-اثر مرکب_Asar-e-Murakkab by Daren Hardy - 59492.txt

  • 02-ارینب دختر زیبای عرب_Arinab, Dukhtar-e-Zebai Arab by Abd-ur-Rahim - 6325.txt

  • 03-ازدواج موفق_Izdiwaj-e-Muafaq by Fatima Shoaibi - 24029.txt

  • 04-اسرار ذهن ثروتمند_Israr-e-Zehn-e-Siratmannd by Theharokar - 49847.txt

  • 05-تلک خرس_Tilk-e-Khers by Muhammad Yosuf & Mark Edkin - 101956.txt

  • 06-چهل گل پرپر_Chehl Gul-e-Parpar by Yonus Ibrahimi - 105538.txt

  • 07-خود آگاهی_Khud Agahi by Osho - 4803.txt

  • 08-خودشناسـي برائی خود خود سازی_Khudshinasi Barai Khudsazi by Muhammad Taqi Misbah Yazdi - 29892.txt

  • 09-خود شناسی-خدا شناسی_Khudshinasi Khuda Shinasi - 10494.txt

  • 10-روشﻫﺎي ﻧﻮﯾﻦ ﺗﺪرﯾﺲ_Rawishhai Nawin-e-Tadris by Zahra Muqarib - 7834.txt

  • 11-عادتهای اتمی_Adathai Atomi by James Kler - 61535.txt

  • 12-مادر_Madar by Kareem Natozi - 10848.txt

  • 13-ﻣﻌﺟزه ﺷﮑرﮔزاری_Moujiza-e-Shukrguzari by Randa Brain - 41532.txt

  • 14-ھدف_Hadaf by Brain Tresy - 8284.txt

  • 15-وجودِ خُدا_Wajood-e-Khuda - 22696.txt

  • 16-عِشق در راہ_Ishq Dar Rah by Kazeem Rashidi - 103500.txt

  • 17-ملکہ_Malaka by Albert Green - 107479.txt

  • 18-اساس آموزش انقلاب_Asas-e-Amozish Inqlab by Ahsan Tabree - 51639.txt

  • 19-توانِ حال_Tawan-e-Hal by Echward Dolle - 70135.txt

  • 20-کُلیات مولانا رومی_Kulliyat by Molana Rumi - 382539.txt

Sample Text

  • مىشود و انسان حقيقت خويش را بى پرده مشاهده مىكند، در اين جا منظور

  • آﻣﻮزش ﯾﮑﯽ از ﻣﺴﺎﺋﻞ ﺑﺴﯿﺎر ﻣﻬﻢ ﻧﻈﺎم ﻫـﺎي ﺗﻌﻠـﯿﻢ و ﺗﺮﺑﯿـﺖ اﺳـﺖ. ﻣﻨﻈـﻮر از آﻣـﻮزش، ﻓﺮآﯾﻨـﺪ دو ﺳـﻮﯾﻪ ﯾﺎددﻫﯽ ـ ﯾﺎدﮔﯿﺮي اﻃﻼﻋﺎت، ﻣﻬﺎرت ﻫﺎ و ﻧﮕﺮﺷﻬﺎي ﻣﺜﺒﺖ درﺑﺎره ﻣﻮﺿﻮﻋﯽ اﺳـﺖ ﮐـﻪ ﻣﺘﻨﺎﺳـﺐ ﺑـﺎ ﮔـﺮوه ﺳـﻨﯽ ﺧﺎص و در ﺷﺮاﯾﻂ زﻣﺎﻧﯽ ﻣﻌﯿﻦ ﺑﻪ اﺟﺮا در آﻣﺪه اﺳﺖ.

  • اگر خدای متعال موی پلک و بدن و ابرو را مثل موی ریش، سر وغیره اعضای معین همیشـه رشد میداد، آیا در کوتاه و بلند کردن و مود کشـیدن آنهـا به چه تکلیف روبرو میبودیم؟

  • ﻣﯽ ﺣﺮف ﻣﺎن ﺑﺎﻃﻨﯽ ﻣﯿﻞ و دروﻧﯽ ﻧﮕﺮش ﺑﺮاﺳﺎس ﺑﭽﮕﯽ زﻣﺎن در ﮐﻪ اﺳﺖ اﯾﻦ ﺻﺪا از ﻣﻨﻈﻮرم .داﺷﺘﯿﻢ را ﺧﻮد ﺻﺪای اﺑﺘﺪا در ﻣﺎ ﻫﻤﻪ

  • تمام این افراد، تیمها و شرکتهایی که در مورد آنها صحبت کردیم، با موقعیتهای متفاوتی روبرو فرزندان کوهسار چون پلنگانآزاد از صخرهای به صخرهای صعود میکنند و چون عقابان بلندپرواز چشمانداز بیکران صحرایی را که دِینو میگوید درهصوف در انتهای آن