Jember Javanese Spontaneous Speech Corpus

License:

CC-BY-NC-SA-4.0

Steward:

Universitas Gadjah Mada

Task: ASR

Release Date: 2/10/2026

Format: MP3, TSV

Size: 271.65 MB

Description

The Jember Javanese Spontaneous Speech Corpus is a spoken dataset of approximately 10 hours of audio collected from native Javanese speakers in Jember Regency, East Java, Indonesia. The corpus represents the Jember dialect of Javanese as well as the Pandhalungan variety. The recordings capture natural and spontaneous speech phenomena, including code-mixing and code-switching between Javanese, Madurese, Indonesian, and English, along with common features of spoken discourse such as reduplication and truncated word forms. These characteristics reflect authentic language use in informal and semi-formal contexts and contribute to phonological and lexical variation in the data. As such, the dataset is well suited for linguistic analysis and supports the non-commercial development and evaluation of Automatic Speech Recognition (ASR) systems for East Javanese varieties.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

Access is granted for research and non-commercial purposes only. Users must cite the dataset appropriately and comply with CC BY-NC-SA 4.0. Commercial use and dubbing applications are not allowed.

Forbidden Usage

Any attempt to identify speakers, clone or imitate their voices, use the data for dubbing or synthetic speech generation, or exploit the dataset for commercial purposes is strictly forbidden.

Processes

Ethical Review

This dataset was not created under a formal institutional review board. All participants provided informed consent prior to recording. Participants were informed about the purpose of data collection, the intended non-commercial use of the dataset, and their right to withdraw from the dataset by contacting the dataset owner.

Intended Use

This dataset is intended for linguistic research and non-commercial ASR development, supporting the analysis of phonological and lexical variation in spoken Javanese from Jember, East Java, Indonesia.

Metadata

Language:

The recordings capture natural and spontaneous speech, including code-mixing and code-switching between Javanese, Madurese, Indonesian, and English, as well as common spoken language features such as reduplication and truncated word forms. The data were produced by adult speakers aged 17 to 50 years from diverse educational and socio-economic backgrounds.

Source(s):

The compilation of spontaneous speech recordings by native Javanese speakers, specifically in Jember, East Java (Pandhalungan dialect), Indonesia. The speakers are around 17 to 50 years old. Besides, they have distinctive social background, in term of education, ages, social status, gender, and culture.

Domain(s):

This dataset covers a broad range of everyday topics, including daily activities, family life, work, community, travel, education, and health.

Size:

271.65 MB or approximately 10 hours-long.

Structure:

Columns in the tsv file contains the following information:

"audio file": the name of audio files

"start": time when speech begins

"end": time when speech begins

"text": speech transcriptions

Sample(s):

"Těrus saiki sampeyan semester pira?"

"Semester pitu berarti kan wis ngerjakne skripsi ya, těrus skripsine sampeyan tentang apa?"

"Kegiatan sabên dinane sampeyan apa mas?"

"Nèk penelitianku iki intine tentang stek tanduran vanili Mbak, tanduran perkebunan."

"Gak mbuwak pampers"

"mun jaremu"

Writing System:

The textual component of this dataset (transcriptions and metadata) uses the Latin script and follows standard Javanese orthographic conventions, with reference to Bausastra Jawa. Transcriptions adhere to established rules for writing Javanese in Latin script and consistently apply relevant diacritics to represent phonological distinctions. The orthography is designed to accurately reflect spoken Javanese while maintaining standardization suitable for linguistic analysis. Code-mixed elements from Madurese, Indonesian, and English are transcribed using their respective standard Latin orthographies.

Useful Link:

https://www.sastra.org/

https://www.sastra.org/leksikon