Jazab Sindhi Newspaper Corpus
License:
CC-BY-NC-SA-4.0
Steward:
Jazab Publishers
Task: NLP
Release Date: 11/9/2025
Format: TXT
Size: 2.33 MB
Description
The corpus contains 1.07 million tokens from the Jazab a Sindhi Newspaper published from the year 2023-2025. The text consists of the complete newspaper content including headlines, editorials, finance news and advertisements. The newspaper published in Karachi, Pakistan on daily basis.
Specifics
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
The data can not be use by any organization having annual revenue of over 1 million USD.
Forbidden Usage
The text cannot be used to create disinformation, synthetic text corpora or hateful content.
Processes
Intended Use
The intended use of the data is to train LLM and create NLP resources for Sindhi.
Metadata
Language
Sindhi (سِنڌِي, Sindhī, [sɪndʱi} is an Indo-Aryan language spoken by the Sindhi in the province of Sindh, Pakistan. It is the official language of the province and constitutes the mother tongue of over 34 million people in Pakistan and 1.7 million people in India.
Script
The following is the list of Sindhi letters
ا ب ٻ ڀ ت ٿ ٽ ٺ ث پ ج ڄ جھ ڃ چ ڇ ح خ د ڌ ڏ ڊ ڍ ذ ر ڙ ز س ش ص ض ط ظ ع غ ف ڦ ق ڪ ک گ ڳ گھ ڱ ل م ن ڻ و ھ ء ي
Processing
The text contains the whole newspaper and may need some processing including removing alphanumerical, non Sindhi characters and sentence parsing and to be used for training purposes.
Sample
قانون لاڳو ڪندڙ ادارن دهشتگردن ۽ انهن جي سهولتڪارن خلاف 59 هزار 775 مختلف ڪامياب انٽيليجنس بيسڊ آپريشن ڪامياب ڪارروائين دوران 925 دهشتگردن کي جهنم ڀيڙو ڪيو ويو، جڏهن ته ڪيترن ئي دهشتگردن کي گرفتار پڻ ڪيو ويو.دهشتگردي جي ناسور جي خاتمي لاءِ پاڪ فوج، انٽيليجنس، پوليس ۽ ٻين قانون لاڳو ڪندڙ ادارن طرفان روزانو 169 کان وڌيڪ آپريشن ڪيا پيا وڃن.هلندڙ سال آپريشنز دوران 73 انتهائي گهربل دهشتگرد ماريا ويا، جن ۾ فدا الرحمان عرف لال،زوب ڊويزن
