Manggarai Language for NLP

License:

CC-BY-NC-SA-4.0

Steward:

Community

Task: TTS

Release Date: 2/13/2026

Format: WEBM, TSV

Size: 287.61 MB

Description

This dataset is a specialized linguistic collection designed to support the development of computational resources for Manggarai, a low-resource Austronesian language spoken primarily on the island of Flores, Indonesia. The dataset bridges the gap between written and spoken language by providing a synchronized collection of textual prompts and their corresponding high-quality audio recordings. This dataset is approximately 50,000 words, providing a substantial baseline for various linguistic analyses and model training. The dataset consists of original textual responses paired with their respective audio counterparts. As a resource for a low-resource language, this dataset is intended for use in the fields of Natural Language Processing (NLP) and Speech Technology.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

This dataset is intended for research, education, and cultural preservation purposes.

Forbidden Usage

This dataset may not be used for commercial purposes, modified in format, or reproduced in any other form.

Processes

Ethical Review

The construction of this dataset followed a rigorous, two-stage process. Initially, a series of prompts were curated and written directly in the Manggarai language. These prompts were designed to capture a wide range of linguistic nuances, semantic structures, and natural speech patterns. Following the drafting phase, the responses were read aloud and recorded to create a speech-to-text corpus. To ensure linguistic accuracy and lexical consistency, the author utilized a publicly available online Indonesian-Manggarai dictionary as a primary reference tool. This ensured that the dataset remains faithful to the grammatical and morphological standards of the language. The data was gathered between October 2025 until January 2026.

Intended Use

This dataset is intended for use in creating automatic speech recognition systems.

Metadata

Language:

The language represented in this dataset is Manggarai. Within the specific technical and academic context of this research, the focus is on the Manggarai dialect of Indonesian, which refers to the variety of the Indonesian language as spoken with local Manggarai linguistic influences. This dataset captures the unique linguistic characteristics of the Manggarai region, including its specific phonology, morphology, and sentence structures—such as the use of minor clauses—which distinguish it from standard Indonesian or other regional dialects. This language is spoken in the western part of Flores Island, East Nusa Tenggara (NTT), Indonesia. Its geographical distribution spans three regencies: Manggarai (with Ruteng as its central hub), West Manggarai, and East Manggarai.

Source(s):

Created by the owner of the dataset, considered as linguists and native speakers. Vocabulary and linguistic accuracy were verified using a publicly available online Indonesian-Manggarai dictionary at https://anyflip.com/rdptn/xhwq

Domain(s):

General domain

Technical Datasheet:

5 hours

Size:

5 hours

Structure:

Audio file name, text

Sample:

Ata raja uwa ho'o ga toé danga bacang surak kabar ai isé céwé laséng baca surak kabar online.

Toé cama agu uwa danong ai toé di manga danong teknologi te pandé surak kabar online.

Toé de camas surak kabar agu surak kabar online, mosé de ata te ho'on ga do taka manga surak kabar online, lorong ngitu do kéta ata bacang surak kabar online hitu.

Ai le manga surak kabar online hitu tara tambang do ata bacang ga.

Landing ata pénong kéta surak kabar online sot laséng kéta bacang de ata raja hitu surak kabar sot manga oné hp.

Writing System:

Latin alphabet (A–Z), Arabic numerals (0–9)

Useful Link:

Indonesian-Manggarai online dictionary at https://anyflip.com/rdptn/xhwq