Tamir Sindhi News Corpus
License:
CC-BY-NC-SA-4.0
Steward:
Tamir News Agency
Task: NLP
Release Date: 11/9/2025
Format: TXT
Size: 2.56 MB
Description
The corpus contains 1.1 million tokens from the Tamir Sindhi Newspaper published from the year 2022-2025. The text consists of the complete newspaper content including headlines, editorials, finance news and advertisements. The newspaper published in Karachi, Pakistan on daily basis.
Specifics
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
The data can not be use by any organization having annual revenue of over 1 million USD.
Forbidden Usage
The text cannot be used to create disinformation, synthetic text corpora or hateful content.
Processes
Intended Use
The intended use of the data is to train LLM and create NLP resources for Sindhi.
Metadata
Language
Sindhi (سِنڌِي, Sindhī, [sɪndʱi} is an Indo-Aryan language spoken by the Sindhi in the province of Sindh, Pakistan. It is the official language of the province and constitutes the mother tongue of over 34 million people in Pakistan and 1.7 million people in India.
Script
The following is the list of Sindhi letters
ا ب ٻ ڀ ت ٿ ٽ ٺ ث پ ج ڄ جھ ڃ چ ڇ ح خ د ڌ ڏ ڊ ڍ ذ ر ڙ ز س ش ص ض ط ظ ع غ ف ڦ ق ڪ ک گ ڳ گھ ڱ ل م ن ڻ و ھ ء ي
Processing
The text contains the whole newspaper and may need some processing including removing alphanumerical, non Sindhi characters and sentence parsing and to be used for training purposes.
Sample
سنڌ ۾ ترقي جو سفر جاري آهي ۽ جاري رهندو هن چيو ته پ پ چيئرمين بلاول ڀٽو زرداري ۽ صدر آصف علي زرداري واضح ڪيو آهي ٺٽو / ملير (نمائيندن وٽان) سنڌ جي وڏي وزير مراد علي شاهه چيو آهي ته ڪالاباغ ڊيم اسلام آباد (مانيٽرنگ ڊيسڪ) وزيراعظم محمد شهباز شريف ملڪ جي پائيدار ترقي رکندڙ هزار ٽيوب ويلز جو منصوبو شروع ڪيو ويو، پنجاب ۽ سنڌ ۾ به ان حوالي سان ڪم جاري آهي
