TTS Javanese - Ngapak Dialect

License icon

License:

CC-BY-SA-4.0

Shield icon

Steward:

Community

Task: TTS

Release Date: 2/10/2026

Format: WEBM, TSV

Size: 567.12 MB


Share

Description

This dataset captures the vibrant and dynamic linguistic variety found along the North Coast (Pantura) of Central Java Province, Indonesia. Unlike the inland varieties of Javanese which are heavily stratified, this dialect reflects the spirit of coastal communities. The language use of this dataset features the distinctive Ngapak phonology—specifically the heavy retention of the vowel /a/ (a-jejeg)—which sets it apart from the standard Solo-Yogyakarta dialect. This dataset offers a rich resource for analyzing the non-hierarchical registers often used in the north coast (Pantura) of Central Java Province, Indonesia. This dataset is essential for developing speech technologies that recognize and process regional accents outside the dominant standard Javanese.

Specifics

Licensing

Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)

https://spdx.org/licenses/CC-BY-SA-4.0.html

Considerations

Restrictions/Special Constraints

This dataset is strictly reserved for academic research and the advancement of language technology for under-resourced dialects. Any commercial deployment requires a separate agreement with the creators. We receive any form of compensation for commercial use.

Forbidden Usage

Dataset users are prohibited from utilizing this dataset to create realistic fake media, cloning the voices of participants without consent, or training models intended for malicious impersonation.

Processes

Ethical Review

Data collection adhered to ethical guidelines where every speaker provided verbal consent. Participants were fully aware that their contributions would be used to build open-source language resources.

Intended Use

Designed to aid in the preservation of the Ngapak dialect and to evaluate the performance of AI models on non-standard Javanese varieties commonly spoken in the Pantura region.

Metadata

Language:

This dataset uses Javanese in a specific variety of Ngapak dialect, featuring a mix- code of English and Indonesian. The dataset highlights the unique phonological characteristics, Ngoko speech level, and informal social vocabulary used by speakers in Tegal and Pemalang Regencies.

Source(s):

Created by the owners of the dataset, considered as linguists and native speakers of Javanese from Ngapak dialect of Tegal and Pemalang city, having authentic accent and vocabulary usage.

Domain(s):

This dataset encompasses a comprehensive spectrum of everyday subjects, ranging from typical daily activities, health, work, and education, alongside personal opinions on social media trends and travel experiences.

Size:

10 hours, 567.12 MB

Technical Datasheet:

10 hours

Structure:

Audio file name, text

Sample:

"Nyong ora ngarti, kue beneran apa ora critane."

“Nyong mbayangna omah sing asri akeh wit-witan”

“belanda swedia dll wis pada mbagikena pamflet nganggo wargane”

"Bocah-bocah pancen angel dikandhani, ndableg banget."

“Mugane nang ndesa, nyong biasa ndeleng kewan urip playon-playonan nang alas”

Writing System:

The textual data utilizes the Latin alphabet. Special attention is given to the orthography of the vowel 'a' to strictly represent the acoustic properties of the Tegal-Pemalang dialect or Ngapak dialect region, distinguishing it from standard Javanese orthography conventions where 'a' might represent 'o'.