Bojonegoro Javanese TTS

License:

CC-BY-SA-4.0

Steward:

Community

Task: TTS

Release Date: 2/19/2026

Format: .tar.gz, WEBM

Size: 469.50 MB

Description

The Bojonegoro Javanese TTS is a speech synthesis dataset containing more than 8 hours of audio recordings, recorded by native speakers of the Javanese language, specifically the Bojonegoro dialect from East Java, Indonesia. This dataset represents the Bojonegoro dialect of Javanese as well as variations of the Aneman dialect. The recordings cover a variety of everyday topics that describe phenomena in daily life in the Bojonegoro area, East Java, Indonesia. The language features in the Bojonegoro Javanese dialect include the use of code-mixing and code-switching between Javanese, Indonesian, and English, reflecting the multilingual nature of the community. This mixing occurs because certain words do not have equivalents in Javanese and are more commonly used in everyday conversation. Therefore, this dataset is suitable for non-commercial linguistic research and can be used to explore phonological and lexical variations in Javanese, particularly the Bojonegoro dialect in East Java, Indonesia.

Specifics

Licensing

Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)

https://spdx.org/licenses/CC-BY-SA-4.0.html

Considerations

Restrictions/Special Constraints

This dataset may only be used for research and non-commercial purposes. All users are required to provide appropriate citation and comply with the terms of the CC BY-SA license

Forbidden Usage

This dataset must not be used to identify speakers, imitate voices, for dubbing or speech synthesis purposes, or for any form of commercial use without the permission of the owner.

Processes

Ethical Review

This dataset was created by writing texts in the Bojonegoro dialect of Javanese with code-mixing in Indonesian and English. The files were read and recorded by native speakers through the hosting platform https://sabre-2.onrender.com/. The collection of audio recordings was compiled into a comprehensive dataset.

Intended Use

This dataset is intended for non-commercial linguistic research and supports the exploration of phonological and lexical variation in spoken Javanese used in Bojonegoro, East Java, Indonesia.

Metadata

Language :

This dataset uses Javanese in Ngoko speech level. This dataset consists of more than 8 hours of synthetic speech, recorded by native speakers of the Bojonegoro Javanese dialect, including variations of the Aneman dialect. The recordings cover everyday topics in Bojonegoro, East Java, and feature code-mixing and code-switching between Javanese, Indonesian, and English. This dataset is suitable for non-commercial linguistic research, particularly for exploring phonological and lexical variations of the Bojonegoro Javanese dialect.

Source(s):

Created by the owner of the dataset, considered as linguists and native speakers of Javanese with Bojonegoro dialect.

Domain(s):

This collection showcases the use of the Javanese language across everyday topics, such as personal daily activities, community activities in Bojonegoro, East Java, Indonesia opinions on education, views on arts and culture, vacation experiences, social media usage, and more.

Size:

More than 8 hours for TTS

Structure:

Audio file name, text

Sample:

“...dikon ngancani ndek e golek klambi..”

“..format reality show sing interaktif…”

Writing System:

Latin alphabet (A–Z), Arabic numerals (0–9)