Datasets

MEDIAMEN

Mediamen Punjabi Literature Corpus

This corpus is a collection of one million tokens of Western Punjabi language. The data was produced under the Mediamen publishing agency over the last ten...

Gear IconTask: NLP

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 1.82 MB

Calendar IconCreated: 11/9/2025

Globe IconLocale: pnb

Chishti Sons

Chishti Sons Punjabi Literature Corpus

This corpus is a collection of more than one million tokens of Western Punjabi language. The data was produced under the Chishti Sons publishing agency. The ...

Gear IconTask: NLP

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 1.65 MB

Calendar IconCreated: 11/17/2025

Globe IconLocale: pnb

Baloch Publishers Multan

Baloch Publishers Saraiki Literature Corpus

This corpus is a collection of one million tokens of Saraiki language. The data was produced under the Baloch Publishers over the last ten years. The corpus ...

Gear IconTask: NLP

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 2.04 MB

Calendar IconCreated: 11/17/2025

Globe IconLocale: skr

Weekly Kaleem Magazine Multan

Kaleem Magazine Urdu Corpus

This corpus is a collection of around 1.4 million tokens of Urdu language. The data was extracted from the archives of a famous Urdu magazine "Kaleem" publis...

Gear IconTask: NLP

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 2.74 MB

Calendar IconCreated: 11/17/2025

Globe IconLocale: urd

Keblagh e Azergi

Keblagh-e-Azergi Hazargi literature corpus

This corpus is a collection of more than one hundred thousand tokens of Hazargi language. The corpus contains work of literature, poems, folk and short stori...

Gear IconTask: NLP

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 193.28 KB

Calendar IconCreated: 12/3/2025

Globe IconLocale: haz

Kaleem Art Press

Kaleem Art Press Urdu Literature Corpus

This corpus is a collection of 1.44 million tokens of Urdu language . The data was produced under the Kaleem Art Press over the last fifteen years . The corp...

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 2.85 MB

Calendar IconCreated: 12/3/2025

Globe IconLocale: ur

Rana Printers Multan

Rana Printers Urdu Literature Corpus

This corpus comprises 1.68 million tokens of high-quality Urdu text collected over the past decade through Rana Printers. It includes a diverse range of lite...

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 3.00 MB

Calendar IconCreated: 12/3/2025

Globe IconLocale: ur

Kaleem Art Press

Kaleem Art Press Saraiki Literature Corpus

This corpus contains approximately one million tokens of Saraiki text curated over the past ten years by Kaleem Art Press. It features a wide range of litera...

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 1.84 MB

Calendar IconCreated: 12/3/2025

Globe IconLocale: skr

Anjuman e Katib

Anjuman-e-Katib Farsi/Persian Literature Corpus

This corpus is a collection of more than one million tokens of Farsi/Persian language. The corpus contains work of literature including novels, fictional and...

Gear IconTask: NLP

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 2.82 MB

Calendar IconCreated: 12/3/2025

Globe IconLocale: fas

Aim Foundation

Aim Foundation Dari Literature Corpus

This corpus is a collection of more than seven hundred thousand tokens of Dari language. The corpus contains work of literature including poems, stories, nov...

Gear IconTask: NLP

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 1.74 MB

Calendar IconCreated: 12/3/2025

Globe IconLocale: prs