Talar (تلار) Barahui Magazine Corpus
License:
CC-BY-NC-SA-4.0
Steward:
Balochistan Educational and Cultural OrganizationTask: NLP
Release Date: 2/10/2026
Format: TXT
Size: 317.22 KB
Share
Description
The corpus consists of approximately 150,000 words collected from Talar, a monthly Brahui-language magazine. The corpus includes a range of written genres such as editorials, essays, fiction, poetry, and socio-cultural commentary, reflecting contemporary Brahui usage. It provides a representative resource for linguistic, literary, and computational research on modern written Brahui.
Specifics
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
The corpus is restricted to research and non-commercial use, and users must provide proper attribution when using or redistributing the data.
Forbidden Usage
It is forbidden to identify individuals, infer sensitive information, use the dataset for surveillance/harassment/hate, or train chatbots or large language models.
Processes
Ethical Review
The data is shared with permission of all the parties involved in the process.
Metadata
Language
Brahui (Brahui: براہوئی; also romanized as Brahvi or Brohi) is a Dravidian language primarily spoken in the central and southern regions of Pakistan’s Balochistan province, with additional speaker communities in Iranian Baluchestan, Afghanistan, and Turkmenistan, particularly around Merv. Smaller expatriate Brahui communities are also found in Iraq, Qatar, and the United Arab Emirates. Geographically, Brahui is highly isolated from other Dravidian languages, with its nearest linguistic relatives located more than 1,500 kilometers away in South India. Within Balochistan, Brahui is predominantly spoken in the districts of Kalat, Khuzdar, Mastung, Quetta, Bolan, Nasirabad, Nushki, and Kharan.
Required Processing
The corpus requires text cleaning and normalization for non brahui text, basic segmentation, metadata annotation, and tokenization
Alphabet Set in Brahui
ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل ڷ م ن ں و ہ ھ ی ے
Sample Text
براہوئی زبان ،رسم الخط نا تاریخ عبدالرازق ابابکی براہوئی زبان نا متکنی تینا جاگہ غا ولے اونا نوشتہ ئی مڈی نا ہُرے آ بناء مننگ و ولدا اونا بھاز کم انگا کچ انا سوب آن دا بولی نا کیہی ویل آک داتم اسکان ایسر مننگ کتنو۔ دا ویل آتیان بنائی ویل اس رسم الخط نا ہم ارے۔ وخت اس کہ لکھوڑ نا بابت ہم امنائی خننگپک۔ دا بابت داتم اسکان مروک آ کوشست آک سرسہب مننگ کتنو۔