Datasets
MEDIAMEN
Mediamen Punjabi Literature Corpus
This corpus is a collection of one million tokens of Western Punjabi language. The data was produced under the Mediamen publishing agency over the last ten...
Task: NLP
Format: TXT
License: CC-BY-NC-4.0
Size: 1.82 MB
Created: 11/9/2025
Locale: pnb
Chishti Sons
Chishti Sons Punjabi Literature Corpus
This corpus is a collection of more than one million tokens of Western Punjabi language. The data was produced under the Chishti Sons publishing agency. The ...
Task: NLP
Format: TXT
License: CC-BY-NC-4.0
Size: 1.65 MB
Created: 11/17/2025
Locale: pnb
Baloch Publishers Multan
Baloch Publishers Saraiki Literature Corpus
This corpus is a collection of one million tokens of Saraiki language. The data was produced under the Baloch Publishers over the last ten years. The corpus ...
Task: NLP
Format: TXT
License: CC-BY-NC-4.0
Size: 2.04 MB
Created: 11/17/2025
Locale: skr
Weekly Kaleem Magazine Multan
Kaleem Magazine Urdu Corpus
This corpus is a collection of around 1.4 million tokens of Urdu language. The data was extracted from the archives of a famous Urdu magazine "Kaleem" publis...
Task: NLP
Format: TXT
License: CC-BY-NC-4.0
Size: 2.74 MB
Created: 11/17/2025
Locale: urd
Keblagh e Azergi
Keblagh-e-Azergi Hazargi literature corpus
This corpus is a collection of more than one hundred thousand tokens of Hazargi language. The corpus contains work of literature, poems, folk and short stori...
Task: NLP
Format: TXT
License: CC-BY-NC-4.0
Size: 193.28 KB
Created: 12/3/2025
Locale: haz
Kaleem Art Press
Kaleem Art Press Urdu Literature Corpus
This corpus is a collection of 1.44 million tokens of Urdu language . The data was produced under the Kaleem Art Press over the last fifteen years . The corp...
Task: OTH
Format: TXT
License: CC-BY-NC-4.0
Size: 2.85 MB
Created: 12/3/2025
Locale: ur
Rana Printers Multan
Rana Printers Urdu Literature Corpus
This corpus comprises 1.68 million tokens of high-quality Urdu text collected over the past decade through Rana Printers. It includes a diverse range of lite...
Task: OTH
Format: TXT
License: CC-BY-NC-4.0
Size: 3.00 MB
Created: 12/3/2025
Locale: ur
Kaleem Art Press
Kaleem Art Press Saraiki Literature Corpus
This corpus contains approximately one million tokens of Saraiki text curated over the past ten years by Kaleem Art Press. It features a wide range of litera...
Task: OTH
Format: TXT
License: CC-BY-NC-4.0
Size: 1.84 MB
Created: 12/3/2025
Locale: skr
Anjuman e Katib
Anjuman-e-Katib Farsi/Persian Literature Corpus
This corpus is a collection of more than one million tokens of Farsi/Persian language. The corpus contains work of literature including novels, fictional and...
Task: NLP
Format: TXT
License: CC-BY-NC-4.0
Size: 2.82 MB
Created: 12/3/2025
Locale: fas
Aim Foundation
Aim Foundation Dari Literature Corpus
This corpus is a collection of more than seven hundred thousand tokens of Dari language. The corpus contains work of literature including poems, stories, nov...
Task: NLP
Format: TXT
License: CC-BY-NC-4.0
Size: 1.74 MB
Created: 12/3/2025
Locale: prs
