Brahui Research Work Corpus

License icon

License:

CC-BY-NC-SA-4.0

Shield icon

Steward:

Balochistan Educational and Cultural Organization

Task: NLP

Release Date: 2/10/2026

Format: TXT

Size: 1.13 MB


Share

Description

This corpus consists of research works, academic theses, and scholarly papers, comprising approximately 185,000 tokens. It covers a range of academic topics and formal registers, reflecting standardized writing practices and disciplinary conventions. The corpus is intended to support linguistic analysis, discourse studies, and computational research on academic and research-oriented text.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

The corpus is restricted to research and non-commercial use, and users must provide proper attribution when using or redistributing the data.

Forbidden Usage

The corpus must not be used for commercial exploitation, surveillance, or any activity that misrepresents, harms, or exploits Brahui speaking communities.

Processes

Ethical Review

The corpus is being shared with permissions from all the authors and researchers.

Metadata

Language

Brahui (Brahui: براہوئی; also romanized as Brahvi or Brohi) is a Dravidian language primarily spoken in the central and southern regions of Pakistan’s Balochistan province, with additional speaker communities in Iranian Baluchestan, Afghanistan, and Turkmenistan, particularly around Merv. Smaller expatriate Brahui communities are also found in Iraq, Qatar, and the United Arab Emirates. Geographically, Brahui is highly isolated from other Dravidian languages, with its nearest linguistic relatives located more than 1,500 kilometers away in South India. Within Balochistan, Brahui is predominantly spoken in the districts of Kalat, Khuzdar, Mastung, Quetta, Bolan, Nasirabad, Nushki, and Kharan.

Required Processing

The required processing includes collecting and digitizing source materials, cleaning and normalizing the text to ensure consistency, and applying tokenization and sentence segmentation. Additional steps involve annotating metadata such as document type, author, and year, as well as conducting quality checks to validate accuracy and usability for research and computational purposes.

Alphabet Set in Brahui

ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل ڷ م ن ں و ہ ھ ی ے

Domains of the Text

The data is being curated from the research work produced under the BECO for the promotion and history of Brahui community.

  • Academic Research and Scholarship

  • Theses and Dissertations

  • Scientific and Scholarly Writing

  • Research Methodology and Analysis

  • Higher Education and Knowledge Production

Sample Text

نظم نا اصطلاحی معنی: کشف تنقیدی اصطلاحات تے ٹی نظم نا تعریف داوڑٹی اے۔ "نظم نا عام مفہو م نا مطابق ہر منظوم کلام نظم اے۔ولے نظم نا اسہ محدود او معنی اسے ۔اونا مطابق نظم اسہ صنف اے سخن اسے۔" (3)