Eastern Balochi Literature Corpus

License:

CC-BY-NC-4.0

Steward:

Balochi Academy

Task: NLP

Release Date: 2/13/2026

Format: TXT

Size: 949.67 KB

Description

The Eastern Balochi Literature Corpus by Balochi Academy is a curated and cleaned collection of literary texts written in Eastern Balochi. The corpus represents a wide range of genres including poetry, folklore, novels, short stories, translations, and cultural writings. Eastern Balochi is considered a group of closely related dialects rather than a single unified dialect. These dialects are often associated with tribal identities such as Marrī, Bugṭī, Leghārī, Mazārī, and Buzdar. Due to historical, geographical, and sociolinguistic factors, Eastern Balochi has received limited scholarly attention compared to other Balochi varieties. The corpus reflects authentic language use across different regions of Eastern Balochistan, Sindh, and Punjab. All files have been cleaned, normalized to UTF-8 Unicode, and structured to support linguistic analysis, corpus linguistics, and natural language processing research.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

- Commercial redistribution of the dataset or its substantial parts without explicit permission from the rights holders is prohibited. - The dataset must not be used to misrepresent, distort, or decontextualize Balochi culture, history, or literature. - Attribution to the original authors, publishers, and Balochi Academy is required where applicable. - Any derivative datasets must clearly state that they are derived from this corpus.

Forbidden Usage

The dataset must NOT be used for: - Hate speech, harassment, or discriminatory content generation - Political manipulation or targeted propaganda - Misattribution of authorship or removal of original credits - Training systems intended for surveillance, profiling, or repression of individuals or communities - Generating misleading or fabricated literary or historical claims attributed to real authors or communities

Processes

Ethical Review

The corpus contains only literary, cultural, and educational texts from published/public sources, includes no sensitive personal data, and was reviewed for cultural respect and authenticity; cleaning and normalization preserved meaning and style.

Intended Use

This dataset is intended for linguistic and corpus research on Eastern Balochi (including dialect variation), low-resource NLP tasks (e.g., language modeling, tokenization, morphological analysis), digital preservation of literary heritage, and educational/academic use, while respecting the cultural value of the texts.

Metadata

Language Information

Language: Eastern Balochi
Language Family: Iranian → Northwestern Iranian → Balochi
Dialect Group: Eastern Balochi
Approximate Size: ~1.9M tokens
Writing Direction: Right-to-left
Encoding: UTF-8 (Unicode normalized)

Domains of the Text

Literature (Creative Writing)
Poetry (Aesthetic and Cultural Expression)
Folklore and Oral Tradition (Textual Form)
Cultural Knowledge and Heritage
Novels, Stories, and Short Stories
Everyday Social Themes

Script Information

Eastern Balochi Script: آ ا ب پ ت ٹ ج چ د ڈ ر ز ژ س ش ک گ ل م ن و ۏ ھ ء ی ے ݔ

Dataset Structure

Total Files: 11
Each file represents a distinct genre or literary domain
File names correspond directly to content
Cleaned layer includes normalized UTF-8 text

File-Level Metadata

01-Articles on Balochi Poetry-Soorat Marri-147765.txt — Articles / Poetry — 147,765 words — TXT
02-Balochi FolkTales-01-132626.txt — Folklore — 132,626 words — TXT
03-Balochi FolkTales-02-271639.txt — Folklore — 271,639 words — TXT
04-Balochi FolkTales-02-Baloch History-43977.txt — History / Folklore — 43,977 words — TXT
05-Balochi FolkTales-03-719875.txt — Folklore — 719,875 words — TXT
06-Murwarid novel-110040.txt — Novel — 110,040 words — TXT
07-Novel-Haji Murad-Rakhshani Dialect-199382.txt — Novel — 199,382 words — TXT
08-ShortStories-8781.txt — Short Stories — 8,781 words — TXT
09-Short-Stories-Aziz Bugti-149625.txt — Short Stories — 149,625 words — TXT
10-Aesop's Fables by Ali Gohar - 41854.txt — Translations — 41,854 words — TXT
11-Qalat by Dr Beezan - 72326.txt — Literature / History — 72,326 words — TXT

Cleaning and Processing

UTF-8 encoding with Unicode normalization
Removal of stray symbols and markup
Standardized punctuation and whitespace
Genre-based file separation preserved

Other Information and Sample Text

The corpus preserves original orthography, stylistic variation, and dialectal features. No content was shortened, altered, or modernized during processing.

Sample Text

چڑا اکرث کہ نماش چتری سر ءَپلوا یک پٹ ءِ تارے اث۔بچ ءَ پولکثہ ’’ابا اے چشوئیں پٹے۔‘‘
آبے دنتانیں پیر مرد ءَ گوں بچکندگے پسّو دات۔
چونائی ءَرجانک کاری گرانیں کاریے۔گوں گِدارُک،گِدار نویس ءَ وتی زبان ءَ انصاف ہم بوت نہ کنت بلئے پدا ہم اے یک کسانیں جُہدے۔
گڈا بادشاہ ءَکُل بلوچانی سردار لوٹائینتھو، گؤشتی کہ ’’شا بلوچ بولک اے۔‘‘
بلوچی قصّہ دی بلوچ راج ءِ کہن تریں لوزانک اِنت۔ او ایشی باروا مڑدمے گوشت نخنت کہ اے کذی جوڑینغ بیثغنت۔