Talar (تلار) Barahui Magazine Corpus

License icon

License:

CC-BY-NC-SA-4.0

Shield icon

Steward:

Balochistan Educational and Cultural Organization

Task: NLP

Release Date: 2/10/2026

Format: TXT

Size: 317.22 KB


Share

Description

The corpus consists of approximately 150,000 words collected from Talar, a monthly Brahui-language magazine. The corpus includes a range of written genres such as editorials, essays, fiction, poetry, and socio-cultural commentary, reflecting contemporary Brahui usage. It provides a representative resource for linguistic, literary, and computational research on modern written Brahui.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

The corpus is restricted to research and non-commercial use, and users must provide proper attribution when using or redistributing the data.

Forbidden Usage

It is forbidden to identify individuals, infer sensitive information, use the dataset for surveillance/harassment/hate, or train chatbots or large language models.

Processes

Ethical Review

The data is shared with permission of all the parties involved in the process.

Metadata

Language

Brahui (Brahui: براہوئی; also romanized as Brahvi or Brohi) is a Dravidian language primarily spoken in the central and southern regions of Pakistan’s Balochistan province, with additional speaker communities in Iranian Baluchestan, Afghanistan, and Turkmenistan, particularly around Merv. Smaller expatriate Brahui communities are also found in Iraq, Qatar, and the United Arab Emirates. Geographically, Brahui is highly isolated from other Dravidian languages, with its nearest linguistic relatives located more than 1,500 kilometers away in South India. Within Balochistan, Brahui is predominantly spoken in the districts of Kalat, Khuzdar, Mastung, Quetta, Bolan, Nasirabad, Nushki, and Kharan.

Required Processing

The corpus requires text cleaning and normalization for non brahui text, basic segmentation, metadata annotation, and tokenization

Alphabet Set in Brahui

ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل ڷ م ن ں و ہ ھ ی ے

Sample Text

براہوئی زبان ،رسم الخط نا تاریخ عبدالرازق ابابکی براہوئی زبان نا متکنی تینا جاگہ غا ولے اونا نوشتہ ئی مڈی نا ہُرے آ بناء مننگ و ولدا اونا بھاز کم انگا کچ انا سوب آن دا بولی نا کیہی ویل آک داتم اسکان ایسر مننگ کتنو۔ دا ویل آتیان بنائی ویل اس رسم الخط نا ہم ارے۔ وخت اس کہ لکھوڑ نا بابت ہم امنائی خننگپک۔ دا بابت داتم اسکان مروک آ کوشست آک سرسہب مننگ کتنو۔