TTS Javanese - Ngapak Dialect
License:
CC-BY-SA-4.0
Steward:
CommunityTask: TTS
Release Date: 2/10/2026
Format: WEBM, TSV
Size: 567.12 MB
Share
Description
This dataset captures the vibrant and dynamic linguistic variety found along the North Coast (Pantura) of Central Java Province, Indonesia. Unlike the inland varieties of Javanese which are heavily stratified, this dialect reflects the spirit of coastal communities. The language use of this dataset features the distinctive Ngapak phonology—specifically the heavy retention of the vowel /a/ (a-jejeg)—which sets it apart from the standard Solo-Yogyakarta dialect. This dataset offers a rich resource for analyzing the non-hierarchical registers often used in the north coast (Pantura) of Central Java Province, Indonesia. This dataset is essential for developing speech technologies that recognize and process regional accents outside the dominant standard Javanese.
Specifics
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
This dataset is strictly reserved for academic research and the advancement of language technology for under-resourced dialects. Any commercial deployment requires a separate agreement with the creators. We receive any form of compensation for commercial use.
Forbidden Usage
Dataset users are prohibited from utilizing this dataset to create realistic fake media, cloning the voices of participants without consent, or training models intended for malicious impersonation.
Processes
Ethical Review
Data collection adhered to ethical guidelines where every speaker provided verbal consent. Participants were fully aware that their contributions would be used to build open-source language resources.
Intended Use
Designed to aid in the preservation of the Ngapak dialect and to evaluate the performance of AI models on non-standard Javanese varieties commonly spoken in the Pantura region.
Metadata
Language:
This dataset uses Javanese in a specific variety of Ngapak dialect, featuring a mix- code of English and Indonesian. The dataset highlights the unique phonological characteristics, Ngoko speech level, and informal social vocabulary used by speakers in Tegal and Pemalang Regencies.
Source(s):
Created by the owners of the dataset, considered as linguists and native speakers of Javanese from Ngapak dialect of Tegal and Pemalang city, having authentic accent and vocabulary usage.
Domain(s):
This dataset encompasses a comprehensive spectrum of everyday subjects, ranging from typical daily activities, health, work, and education, alongside personal opinions on social media trends and travel experiences.
Size:
10 hours, 567.12 MB
Technical Datasheet:
10 hours
Structure:
Audio file name, text
Sample:
"Nyong ora ngarti, kue beneran apa ora critane."
“Nyong mbayangna omah sing asri akeh wit-witan”
“belanda swedia dll wis pada mbagikena pamflet nganggo wargane”
"Bocah-bocah pancen angel dikandhani, ndableg banget."
“Mugane nang ndesa, nyong biasa ndeleng kewan urip playon-playonan nang alas”
Writing System:
The textual data utilizes the Latin alphabet. Special attention is given to the orthography of the vowel 'a' to strictly represent the acoustic properties of the Tegal-Pemalang dialect or Ngapak dialect region, distinguishing it from standard Javanese orthography conventions where 'a' might represent 'o'.