NAWA-E-WATAN Balochi Newspaper Corpus

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

Balochistan Educational and Cultural Organization

Task: NLP

Release Date: 2/10/2026

Format: TXT

Size: 1.43 MB


Share

Description

The NAWA-E-WATAN Balochi Newspaper Corpus is a large-scale collection of contemporary Balochi journalistic text comprising approximately ~1.02 million tokens. The corpus reflects modern written Balochi as used in daily news reporting, political coverage, social issues, editorials, and public discourse. The dataset represents General / Western Balochi (Rakhshani), the variety most commonly employed in print media and newspapers. It provides valuable material for linguistic research, natural language processing (NLP), corpus linguistics, media studies, and the documentation of modern Balochi language usage.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

This dataset is intended solely for research, educational, and non-commercial purposes. Users must ensure appropriate citation and acknowledgment of the source when using this dataset in academic or research outputs.

Forbidden Usage

Any attempt to identify individuals mentioned in the dataset, infer personal/political/sensitive information, or use the data for surveillance, profiling, or other harmful activities is strictly forbidden.

Processes

Ethical Review

The data is being shared with the permission of all the relevant parties involved in data creation.

Intended Use

This dataset is intended for linguistic research, corpus-based studies, natural language processing, language modeling, information retrieval, media analysis, and the documentation of contemporary Balochi journalistic language.

Metadata

Language

Balochi (بلۏچی) is a Northwestern Iranian language spoken across Balochistan (Pakistan and Iran), parts of Afghanistan, and by diaspora communities worldwide. This corpus primarily represents General / Western Balochi (Rakhshani), widely used in journalism and public communication.

Source / Publisher

NAWA-E-WATAN (Balochi Newspaper)

Data Format

  • Plain text (.txt)

  • UTF-8 encoded

  • Unicode normalized

Domains of the Text

  • Journalism and news reporting

  • Politics and governance

  • Social and public affairs

  • Editorial and opinion writing

  • Cultural and regional reporting

  • Public information dissemination

Dataset Structure

  • Total files: 1

  • The corpus is stored as a single consolidated text file.

  • Treated as a single genre container (newspaper journalism).

Text Cleaning and Processing

  • UTF-8 encoding

  • Unicode normalization

  • Whitespace and punctuation cleanup

  • Removal of stray symbols and markup

  • Original content preserved without alteration

Sample Text

صدر آصف علی زرداری گلگت ءُ بلتستان ءِ آجوئی ءِ روچ ءِ مراگش ءَ بہر زورگ ءِ واستہ گلگت ءَ سر بوتگ کشمیر ءِ کار ءُ گلگت بلتستان ءِ وفاقی وزیر سیفرون ءِ انجینئر امیر مقام ءَ آجوئی ءِ روچ ءِ موہ ءَ گلگت بلتستان ءِ بہادریں اُلس ءَ را دل ءِ جہلانکی ءَ چہ مبارک بات دیگ ءِ وھد ءَ گوشت کہ وفاقی حکومت گلتیل بلتستان ءِ دیمروئی ءُ جوڑیشت ءَ را اولی ترجی ءَ دنت۔ سیاحت ءِ لحاظ ءَ دنیا ءِ تہا وتی جاہ ءَ زرتگ۔ شمبے ءِ روچ ءَ گلگت بلتستان ءِ آجوئی ءِ روچ ءِ موہ ءَ وتی ھاسیں کلوہ ءِ تہ ءَ وفاقی وزیر ءَ گوشت کہ گلگت بلتستان ءِ الس دو رند ءَ وتی وطن ءِ آجوئی ءِ روچ ءَ برجم دار اَنت یک رندے 14اگست ءُ دومی یکم نومبر ءَ اے روچ گلتستان ءِ راجدپتر ءِ تہ ءَ یک ھاسیں سنگ میل ءِ رنگ ءَ زانگ بیت۔ وفاقی وزیر ءُ پاکستان مسلم لیگ خیبر پختونخوا ءِ صدر انجینئر امیر مقام ءَ مسلم لیگ ن سندھ ءِ سینئر نائب صدر انجینئر امیر مقام ءَ اسلام آباد ءَ گوں گند ءُ نند کتگ۔ پاکستان ءُ ایران ءِ درئی کارانی وزیرانی گند ءُ نند، دو نیمگی سیادی ءَ را مھکم کنگ ءِ سر ءَ زور پر داتگ نیشنل ہائی وے اتھارٹی ءِ نیمگ ءَ 67 جاری ءُ 4 نوکیں منصوبگ آنی واستہ 226 ارب 981.65 ملین کلدار ایر کنگ بوتگ