Eastern Balochi Literature Corpus

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

Balochi Academy

Task: NLP

Release Date: 2/13/2026

Format: TXT

Size: 949.67 KB


Share

Description

The Eastern Balochi Literature Corpus by Balochi Academy is a curated and cleaned collection of literary texts written in Eastern Balochi. The corpus represents a wide range of genres including poetry, folklore, novels, short stories, translations, and cultural writings. Eastern Balochi is considered a group of closely related dialects rather than a single unified dialect. These dialects are often associated with tribal identities such as Marrī, Bugṭī, Leghārī, Mazārī, and Buzdar. Due to historical, geographical, and sociolinguistic factors, Eastern Balochi has received limited scholarly attention compared to other Balochi varieties. The corpus reflects authentic language use across different regions of Eastern Balochistan, Sindh, and Punjab. All files have been cleaned, normalized to UTF-8 Unicode, and structured to support linguistic analysis, corpus linguistics, and natural language processing research.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

- Commercial redistribution of the dataset or its substantial parts without explicit permission from the rights holders is prohibited. - The dataset must not be used to misrepresent, distort, or decontextualize Balochi culture, history, or literature. - Attribution to the original authors, publishers, and Balochi Academy is required where applicable. - Any derivative datasets must clearly state that they are derived from this corpus.

Forbidden Usage

The dataset must NOT be used for: - Hate speech, harassment, or discriminatory content generation - Political manipulation or targeted propaganda - Misattribution of authorship or removal of original credits - Training systems intended for surveillance, profiling, or repression of individuals or communities - Generating misleading or fabricated literary or historical claims attributed to real authors or communities

Processes

Ethical Review

The corpus contains only literary, cultural, and educational texts from published/public sources, includes no sensitive personal data, and was reviewed for cultural respect and authenticity; cleaning and normalization preserved meaning and style.

Intended Use

This dataset is intended for linguistic and corpus research on Eastern Balochi (including dialect variation), low-resource NLP tasks (e.g., language modeling, tokenization, morphological analysis), digital preservation of literary heritage, and educational/academic use, while respecting the cultural value of the texts.

Metadata

Language Information

  • Language: Eastern Balochi

  • Language Family: Iranian → Northwestern Iranian → Balochi

  • Dialect Group: Eastern Balochi

  • Approximate Size: ~1.9M tokens

  • Writing Direction: Right-to-left

  • Encoding: UTF-8 (Unicode normalized)

Domains of the Text

  • Literature (Creative Writing)

  • Poetry (Aesthetic and Cultural Expression)

  • Folklore and Oral Tradition (Textual Form)

  • Cultural Knowledge and Heritage

  • Novels, Stories, and Short Stories

  • Everyday Social Themes

Script Information

Eastern Balochi Script: آ ا ب پ ت ٹ ج چ د ڈ ر ز ژ س ش ک گ ل م ن و ۏ ھ ء ی ے ݔ

Dataset Structure

  • Total Files: 11

  • Each file represents a distinct genre or literary domain

  • File names correspond directly to content

  • Cleaned layer includes normalized UTF-8 text

File-Level Metadata

  1. 01-Articles on Balochi Poetry-Soorat Marri-147765.txt — Articles / Poetry — 147,765 words — TXT

  2. 02-Balochi FolkTales-01-132626.txt — Folklore — 132,626 words — TXT

  3. 03-Balochi FolkTales-02-271639.txt — Folklore — 271,639 words — TXT

  4. 04-Balochi FolkTales-02-Baloch History-43977.txt — History / Folklore — 43,977 words — TXT

  5. 05-Balochi FolkTales-03-719875.txt — Folklore — 719,875 words — TXT

  6. 06-Murwarid novel-110040.txt — Novel — 110,040 words — TXT

  7. 07-Novel-Haji Murad-Rakhshani Dialect-199382.txt — Novel — 199,382 words — TXT

  8. 08-ShortStories-8781.txt — Short Stories — 8,781 words — TXT

  9. 09-Short-Stories-Aziz Bugti-149625.txt — Short Stories — 149,625 words — TXT

  10. 10-Aesop's Fables by Ali Gohar - 41854.txt — Translations — 41,854 words — TXT

  11. 11-Qalat by Dr Beezan - 72326.txt — Literature / History — 72,326 words — TXT

Cleaning and Processing

  • UTF-8 encoding with Unicode normalization

  • Removal of stray symbols and markup

  • Standardized punctuation and whitespace

  • Genre-based file separation preserved

Other Information and Sample Text

The corpus preserves original orthography, stylistic variation, and dialectal features. No content was shortened, altered, or modernized during processing.

Sample Text

چڑا اکرث کہ نماش چتری سر ءَپلوا یک پٹ ءِ تارے اث۔بچ ءَ پولکثہ ’’ابا اے چشوئیں پٹے۔‘‘
آبے دنتانیں پیر مرد ءَ گوں بچکندگے پسّو دات۔
چونائی ءَرجانک کاری گرانیں کاریے۔گوں گِدارُک،گِدار نویس ءَ وتی زبان ءَ انصاف ہم بوت نہ کنت بلئے پدا ہم اے یک کسانیں جُہدے۔
گڈا بادشاہ ءَکُل بلوچانی سردار لوٹائینتھو، گؤشتی کہ ’’شا بلوچ بولک اے۔‘‘
بلوچی قصّہ دی بلوچ راج ءِ کہن تریں لوزانک اِنت۔ او ایشی باروا مڑدمے گوشت نخنت کہ اے کذی جوڑینغ بیثغنت۔