Brahui Research Work Corpus
License:
CC-BY-NC-SA-4.0
Steward:
Balochistan Educational and Cultural OrganizationTask: NLP
Release Date: 2/10/2026
Format: TXT
Size: 1.13 MB
Share
Description
This corpus consists of research works, academic theses, and scholarly papers, comprising approximately 185,000 tokens. It covers a range of academic topics and formal registers, reflecting standardized writing practices and disciplinary conventions. The corpus is intended to support linguistic analysis, discourse studies, and computational research on academic and research-oriented text.
Specifics
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
The corpus is restricted to research and non-commercial use, and users must provide proper attribution when using or redistributing the data.
Forbidden Usage
The corpus must not be used for commercial exploitation, surveillance, or any activity that misrepresents, harms, or exploits Brahui speaking communities.
Processes
Ethical Review
The corpus is being shared with permissions from all the authors and researchers.
Metadata
Language
Brahui (Brahui: براہوئی; also romanized as Brahvi or Brohi) is a Dravidian language primarily spoken in the central and southern regions of Pakistan’s Balochistan province, with additional speaker communities in Iranian Baluchestan, Afghanistan, and Turkmenistan, particularly around Merv. Smaller expatriate Brahui communities are also found in Iraq, Qatar, and the United Arab Emirates. Geographically, Brahui is highly isolated from other Dravidian languages, with its nearest linguistic relatives located more than 1,500 kilometers away in South India. Within Balochistan, Brahui is predominantly spoken in the districts of Kalat, Khuzdar, Mastung, Quetta, Bolan, Nasirabad, Nushki, and Kharan.
Required Processing
The required processing includes collecting and digitizing source materials, cleaning and normalizing the text to ensure consistency, and applying tokenization and sentence segmentation. Additional steps involve annotating metadata such as document type, author, and year, as well as conducting quality checks to validate accuracy and usability for research and computational purposes.
Alphabet Set in Brahui
ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل ڷ م ن ں و ہ ھ ی ے
Domains of the Text
The data is being curated from the research work produced under the BECO for the promotion and history of Brahui community.
Academic Research and Scholarship
Theses and Dissertations
Scientific and Scholarly Writing
Research Methodology and Analysis
Higher Education and Knowledge Production
Sample Text
نظم نا اصطلاحی معنی: کشف تنقیدی اصطلاحات تے ٹی نظم نا تعریف داوڑٹی اے۔ "نظم نا عام مفہو م نا مطابق ہر منظوم کلام نظم اے۔ولے نظم نا اسہ محدود او معنی اسے ۔اونا مطابق نظم اسہ صنف اے سخن اسے۔" (3)