Western Balochi Literature Cropus
License:
CC-BY-NC-4.0
Steward:
Balochistan Educational and Cultural OrganizationTask: NLP
Release Date: 2/10/2026
Format: TXT
Size: 2.26 MB
Share
Description
This dataset is a curated literary corpus of General/Western Balochi (Rakhshani) prepared by the Balochistan Educational and Cultural Organization (BECO), bringing together digitized UTF-8 texts across genres such as poetry, creative literature, folklore-based writing, research articles, academic theses, translations, and other written materials. It reflects authentic Balochi usage from traditional and modern sources and is intended to support language documentation, linguistic research, digital humanities, and NLP development for an under-resourced Iranian language.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
- Attribution to BECO and original authors is mandatory - Commercial reuse without permission is not allowed - Cultural and historical context must be respected
Forbidden Usage
- Hate speech or discriminatory content generation - Political manipulation or propaganda - Surveillance or profiling of communities - Misrepresentation of authorship or cultural context
Processes
Ethical Review
The is being shared with permission of all the relevant parties, for more information please reach out to the point of contact.
Intended Use
- Linguistic and corpus-based research - Literary and cultural studies - NLP research for low-resource languages - Language documentation and preservation - Educational and academic use
Metadata
Language
Balochi (بلۏچی) is a Northwestern Iranian language spoken primarily across Balochistan (Pakistan and Iran), parts of Afghanistan, and in diaspora communities in the Gulf and beyond. This corpus mainly represents General / Western Balochi (Rakhshani), a widely used variety in literary and academic writing.
Writing Script:
ا ب پ ت ٹ ث ج چ ح خ د ذ ڑ ر ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ہ ی ے ء
Source / Publisher
Balochistan Educational and Cultural Organization (BECO)
Data Format
Plain text files (.txt)
UTF-8 encoded
Unicode normalized
Cleaned for whitespace and punctuation consistency
Domains of the Text
Literature (creative writing)
Poetry
Folklore and oral tradition (textual form)
Cultural and historical writing
Research articles and academic writing
Translations and scholarly texts
Dataset Structure
Total files: 4
Each file represents a distinct genre or functional category.
Files are treated as independent genre containers.
Text Cleaning and Processing
UTF-8 encoding
Unicode normalization
Removal of stray symbols and markup
Basic punctuation and whitespace cleanup
No semantic alteration of original content
Sample Text
بلوچی زبان ءُ لبزانک وہدے ماں وانگ ءُ نبشتہی رنگء َ دیم ءَ آیاں بیت داں اے کاروان ءِ تہ ءَ روکپتی بلوچستان ءِ بازیں مردم ءِ وت ءَ ہمے رُمب ءِ توک ءَ ھوار گیجیت،بلوچی لبزانک ءِ ھمے کاروان ءَ اشرف سربازی ھم ھوار بیت ءُ بلوچی زبان ءِ بازیں پہناتانی سرا کار کنت ءُ وت ءَ ماں لبزانک ءَ نمیراں کنت۔ چہ امرِتسر ءَ گِچینیں ریل بیگاہ ءِ دو اَدار ءَ درآتک چہ ھشت ادار ءَ پد مُگل پورہ ءَ سر بوت، راہ ءَ گپڑے مردم جنگ بوت، بازینے ٹپّی ءُ لھتیں شِنگ ءُ شانگ بوت۔ ’’بلوچی یک ایرانی زبانے، اے رودراتکی ایرانی شاخءَ گوں سیادی داریت ءُ ا ے اوستا ءَ چہ گیش گوں کہنین فارسی ءَ سیادی داریت‘‘۔ بلوچی زبان ءِ تامداریں گالوار رخشانی گالوارءِہند ءُ دمگ خاران ءَ چے سیادی داروکیں نوکتریں شاعرگوں وتی سادگیں ءُ تچکیں ھیالاں اُلسی شاعری ءِ یک حاصیں بہرے جہتءَ پہ وتءَ جوانیں جاہ ءِ جوڑ کنوک قیوم سادگ 8 مارچ 1978 ءَ ماں خاران ءَ ودی بوتگ۔بنداتی وانگ شہ خاران ءِ وانگجاہ ءَدہمی تبک 1994ء ءِ سالءَ دربرتگ، ءُ پدا شہ خاران ءِ کالجءَبی۔اے چکاسءَ سوبیں بوتگ ھستیں زمانگ ءَماں قیوم سادگ بلوچی زبان ءِ نامداریں شاعرے زانگ بیت۔