NAWA-E-WATAN Balochi Newspaper Corpus
License:
CC-BY-NC-4.0
Steward:
Balochistan Educational and Cultural OrganizationTask: NLP
Release Date: 2/10/2026
Format: TXT
Size: 1.43 MB
Share
Description
The NAWA-E-WATAN Balochi Newspaper Corpus is a large-scale collection of contemporary Balochi journalistic text comprising approximately ~1.02 million tokens. The corpus reflects modern written Balochi as used in daily news reporting, political coverage, social issues, editorials, and public discourse. The dataset represents General / Western Balochi (Rakhshani), the variety most commonly employed in print media and newspapers. It provides valuable material for linguistic research, natural language processing (NLP), corpus linguistics, media studies, and the documentation of modern Balochi language usage.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
This dataset is intended solely for research, educational, and non-commercial purposes. Users must ensure appropriate citation and acknowledgment of the source when using this dataset in academic or research outputs.
Forbidden Usage
Any attempt to identify individuals mentioned in the dataset, infer personal/political/sensitive information, or use the data for surveillance, profiling, or other harmful activities is strictly forbidden.
Processes
Ethical Review
The data is being shared with the permission of all the relevant parties involved in data creation.
Intended Use
This dataset is intended for linguistic research, corpus-based studies, natural language processing, language modeling, information retrieval, media analysis, and the documentation of contemporary Balochi journalistic language.
Metadata
Language
Balochi (بلۏچی) is a Northwestern Iranian language spoken across Balochistan (Pakistan and Iran), parts of Afghanistan, and by diaspora communities worldwide. This corpus primarily represents General / Western Balochi (Rakhshani), widely used in journalism and public communication.
Source / Publisher
NAWA-E-WATAN (Balochi Newspaper)
Data Format
Plain text (.txt)
UTF-8 encoded
Unicode normalized
Domains of the Text
Journalism and news reporting
Politics and governance
Social and public affairs
Editorial and opinion writing
Cultural and regional reporting
Public information dissemination
Dataset Structure
Total files: 1
The corpus is stored as a single consolidated text file.
Treated as a single genre container (newspaper journalism).
Text Cleaning and Processing
UTF-8 encoding
Unicode normalization
Whitespace and punctuation cleanup
Removal of stray symbols and markup
Original content preserved without alteration
Sample Text
صدر آصف علی زرداری گلگت ءُ بلتستان ءِ آجوئی ءِ روچ ءِ مراگش ءَ بہر زورگ ءِ واستہ گلگت ءَ سر بوتگ کشمیر ءِ کار ءُ گلگت بلتستان ءِ وفاقی وزیر سیفرون ءِ انجینئر امیر مقام ءَ آجوئی ءِ روچ ءِ موہ ءَ گلگت بلتستان ءِ بہادریں اُلس ءَ را دل ءِ جہلانکی ءَ چہ مبارک بات دیگ ءِ وھد ءَ گوشت کہ وفاقی حکومت گلتیل بلتستان ءِ دیمروئی ءُ جوڑیشت ءَ را اولی ترجی ءَ دنت۔ سیاحت ءِ لحاظ ءَ دنیا ءِ تہا وتی جاہ ءَ زرتگ۔ شمبے ءِ روچ ءَ گلگت بلتستان ءِ آجوئی ءِ روچ ءِ موہ ءَ وتی ھاسیں کلوہ ءِ تہ ءَ وفاقی وزیر ءَ گوشت کہ گلگت بلتستان ءِ الس دو رند ءَ وتی وطن ءِ آجوئی ءِ روچ ءَ برجم دار اَنت یک رندے 14اگست ءُ دومی یکم نومبر ءَ اے روچ گلتستان ءِ راجدپتر ءِ تہ ءَ یک ھاسیں سنگ میل ءِ رنگ ءَ زانگ بیت۔ وفاقی وزیر ءُ پاکستان مسلم لیگ خیبر پختونخوا ءِ صدر انجینئر امیر مقام ءَ مسلم لیگ ن سندھ ءِ سینئر نائب صدر انجینئر امیر مقام ءَ اسلام آباد ءَ گوں گند ءُ نند کتگ۔ پاکستان ءُ ایران ءِ درئی کارانی وزیرانی گند ءُ نند، دو نیمگی سیادی ءَ را مھکم کنگ ءِ سر ءَ زور پر داتگ نیشنل ہائی وے اتھارٹی ءِ نیمگ ءَ 67 جاری ءُ 4 نوکیں منصوبگ آنی واستہ 226 ارب 981.65 ملین کلدار ایر کنگ بوتگ