BECO Brahui Literature Corpus

License icon

License:

CC-BY-NC-SA-4.0

Shield icon

Steward:

Balochistan Educational and Cultural Organization

Task: NLP

Release Date: 3/17/2026

Format: TXT

Size: 1.19 MB


Share

Description

This Brahui literary corpus contains short stories, novels, and other creative literary works, representing a broad range of narrative styles and themes within Brahui literature. The texts reflect both classical and contemporary writing, offering insight into cultural expression and linguistic variation in Brahui. The corpus comprises approximately 355,000 tokens, making it a valuable resource for linguistic research and natural language processing tasks involving an under-resourced language.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

The corpus is restricted to research and non-commercial use, and users must provide proper attribution when using or redistributing the data.

Forbidden Usage

The corpus must not be used for commercial exploitation, surveillance, or any activity that misrepresents, harms, or exploits Brahui speaking communities.

Processes

Ethical Review

The data is curated and uploaded with permission of all the relevant parties, for more information please reach out to point of contact.

Intended Use

Metadata

Language

Brahui (Brahui: براہوئی; also romanized as Brahvi or Brohi) is a Dravidian language primarily spoken in the central and southern regions of Pakistan’s Balochistan province, with additional speaker communities in Iranian Baluchestan, Afghanistan, and Turkmenistan, particularly around Merv. Smaller expatriate Brahui communities are also found in Iraq, Qatar, and the United Arab Emirates. Geographically, Brahui is highly isolated from other Dravidian languages, with its nearest linguistic relatives located more than 1,500 kilometers away in South India. Within Balochistan, Brahui is predominantly spoken in the districts of Kalat, Khuzdar, Mastung, Quetta, Bolan, Nasirabad, Nushki, and Kharan.

Domains in Corpus

  • Literature and Fiction

  • Arts and Culture

  • Narrative and Storytelling

  • Creative Writing

  • Cultural Heritage and Expression

Required Processing

The texts in this corpus were carefully processed to ensure consistency and usability for linguistic and computational research. Processing steps included cleaning the data, normalizing orthography, and removing duplicate or corrupted entries while preserving original literary features. The corpus was then tokenized, resulting in a total of approximately 355,000 tokens. Quality checks were performed to maintain textual integrity and linguistic accuracy throughout the dataset.

Alphabet Set in Brahui

ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل ڷ م ن ں و ہ ھ ی ے

Sample Text

عشق (افسانہ)۔ شاہ رحمن اؤ او کنا زند انا اولیکو محبت اس، اونا ہر ہیت کنا اُست اٹ ہندُن پیوست ئس، امر کہ روح جان اٹ۔ اونا عشق کنے ہنداخہ در گم کریسس کہ کنا یات آن ہناسس کہ ہراتم روح جان آن جتا مریک۔ تو او اسہ مڑدہ ءُ لاش اس مریک، ہرانا جاگہ بیرہ مِشک ءُ۔۔۔