Western Balochi Literature Cropus

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

Balochistan Educational and Cultural Organization

Task: NLP

Release Date: 2/10/2026

Format: TXT

Size: 2.26 MB


Share

Description

This dataset is a curated literary corpus of General/Western Balochi (Rakhshani) prepared by the Balochistan Educational and Cultural Organization (BECO), bringing together digitized UTF-8 texts across genres such as poetry, creative literature, folklore-based writing, research articles, academic theses, translations, and other written materials. It reflects authentic Balochi usage from traditional and modern sources and is intended to support language documentation, linguistic research, digital humanities, and NLP development for an under-resourced Iranian language.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

- Attribution to BECO and original authors is mandatory - Commercial reuse without permission is not allowed - Cultural and historical context must be respected

Forbidden Usage

- Hate speech or discriminatory content generation - Political manipulation or propaganda - Surveillance or profiling of communities - Misrepresentation of authorship or cultural context

Processes

Ethical Review

The is being shared with permission of all the relevant parties, for more information please reach out to the point of contact.

Intended Use

- Linguistic and corpus-based research - Literary and cultural studies - NLP research for low-resource languages - Language documentation and preservation - Educational and academic use

Metadata

Language

Balochi (بلۏچی) is a Northwestern Iranian language spoken primarily across Balochistan (Pakistan and Iran), parts of Afghanistan, and in diaspora communities in the Gulf and beyond. This corpus mainly represents General / Western Balochi (Rakhshani), a widely used variety in literary and academic writing.

Writing Script:

ا ب پ ت ٹ ث ج چ ح خ د ذ ڑ ر ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ہ ی ے ء

Source / Publisher

Balochistan Educational and Cultural Organization (BECO)

Data Format

  • Plain text files (.txt)

  • UTF-8 encoded

  • Unicode normalized

  • Cleaned for whitespace and punctuation consistency

Domains of the Text

  • Literature (creative writing)

  • Poetry

  • Folklore and oral tradition (textual form)

  • Cultural and historical writing

  • Research articles and academic writing

  • Translations and scholarly texts

Dataset Structure

  • Total files: 4

  • Each file represents a distinct genre or functional category.

  • Files are treated as independent genre containers.

Text Cleaning and Processing

  • UTF-8 encoding

  • Unicode normalization

  • Removal of stray symbols and markup

  • Basic punctuation and whitespace cleanup

  • No semantic alteration of original content

Sample Text

بلوچی زبان ءُ لبزانک وہدے ماں وانگ ءُ نبشتہی رنگء َ دیم ءَ آیاں بیت داں اے کاروان ءِ تہ ءَ روکپتی بلوچستان ءِ بازیں مردم ءِ وت ءَ ہمے رُمب ءِ توک ءَ ھوار گیجیت،بلوچی لبزانک ءِ ھمے کاروان ءَ اشرف سربازی ھم ھوار بیت ءُ بلوچی زبان ءِ بازیں پہناتانی سرا کار کنت ءُ وت ءَ ماں لبزانک ءَ نمیراں کنت۔ چہ امرِتسر ءَ گِچینیں ریل بیگاہ ءِ دو اَدار ءَ درآتک چہ ھشت ادار ءَ پد مُگل پورہ ءَ سر بوت، راہ ءَ گپڑے مردم جنگ بوت، بازینے ٹپّی ءُ لھتیں شِنگ ءُ شانگ بوت۔ ’’بلوچی یک ایرانی زبانے، اے رودراتکی ایرانی شاخءَ گوں سیادی داریت ءُ ا ے اوستا ءَ چہ گیش گوں کہنین فارسی ءَ سیادی داریت‘‘۔ بلوچی زبان ءِ تامداریں گالوار رخشانی گالوارءِہند ءُ دمگ خاران ءَ چے سیادی داروکیں نوکتریں شاعرگوں وتی سادگیں ءُ تچکیں ھیالاں اُلسی شاعری ءِ یک حاصیں بہرے جہتءَ پہ وتءَ جوانیں جاہ ءِ جوڑ کنوک قیوم سادگ 8 مارچ 1978 ءَ ماں خاران ءَ ودی بوتگ۔بنداتی وانگ شہ خاران ءِ وانگجاہ ءَدہمی تبک 1994ء ءِ سالءَ دربرتگ، ءُ پدا شہ خاران ءِ کالجءَبی۔اے چکاسءَ سوبیں بوتگ ھستیں زمانگ ءَماں قیوم سادگ بلوچی زبان ءِ نامداریں شاعرے زانگ بیت۔