Common Voice Spontaneous Speech 1.0 - Betawi

License icon

License:

CC0-1.0

Shield icon

Steward:

Common Voice

Task: ASR

Release Date: 9/15/2025

Format: MP3

Size: 213.90 MB


Description

A collection of spontaneous spoken phrases in Betawi.

Considerations

Restrictions/Special Constraints

You agree that you will not re-host or re-share this dataset

Forbidden Usage

You agree not to attempt to determine the identity of speakers in the Common Voice dataset

Processes

Intended Use

This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.

Metadata

Betawi — Betawi (bew)

This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Betawi (bew). The dataset contains 11 hours of recorded speech (11 hours validated) from 21 speakers.

Language

Betawi language originally belongs to Austronesian language with a full name of Melayu-Betawi. This language is considered as one of Malay dialects, but historically it grew together with other major languages, such as Arabic, Hokkien, Sundanese, Javanese, and Malay in Sumatra - a tiny portion with Portuguese and Dutch. The language vitality status is Endangered according to https://www.ethnologue.com/language/bew/. At the moment, Indonesian standard and English in general influence the native speakers, allowing code switching and code mixing happens in a spontaneous speech. The specific variation of this dataset is Betawi Ora or Betawi Pinggiran (Peripheral Betawi), taken from several locations of Bekasi District/City, West Java Province, Indonesia. This variation is unique in terms of geo-politics: language is spoken only in the community, but it is not taught at school. Instead, the community is taught Sundanese language, which is dominated in West Java Province in general.

Demographic information

The dataset includes the following distribution of age and gender.

Gender

Self-declared gender information, frequency refers to the number of clips annotated with this gender.

Age

Self-declared age information, frequency refers to the number of clips annotated with this age band.

Transcriptions

The transcription system uses general Latin script, but involves allophone variants of three /e/, these are /é/, /è/, and /e/.

Writing system

Historically, this language used Pegon, Arabic script, but now Latin is adapted.The writing system in this dataset uses general Latin script, but involves allophone variants of three /e/, these are /é/, /è/, and /e/.

Symbol table
a b c d é è ȇ e f g h i j k l m n o p q r s t u v w y z
Questions

There follows a randomly selected sample of transcribed responses from the corpus.

Begimané pendidikan keluargé di masyarakat sekitar Ente?
Pigimané masyarakat di lingkungan Ente ngejagé atow melestarikan alam di sekitar?
Menurut Ente déwék, seberapé besar peran tuh kesenian buat acaré khusus?
Elmu apé nyang bakalan penting buat dipelajarin di masé depan?
Begimané caré kité ngedidik generasi mudé biar lebih peduli ngejagé lingkungan?
Responses

There follows a randomly selected sample of transcribed responses from the corpus.

Kalo di sini mah pendidikan paling utama, ya minimal lulus SMA.
Jadinya di sini mah bagan emaknya pada kuli nandur yang penting anaknya sekolȇh.
Jadinya diusahain banget pendidikan di sini.
Bagan boleh utang kék, èmaknya boleh kuli nandur, kuli nyuci, yang penting anaknyé sekolȇh.

Recommended post-processing

(1) Observe the non-linguistic aspects, such as filler, (2) Make sure your machine learning does not differ the suprasegmental aspect, like intonation which does not change the word and its meaning.

Community links

Contribute

Datasheet authors

  • Yacub Fahmilda

  • Riska Legistari Febri

Funding

This dataset was fully funded by the Open Multilingual Speech Fund.

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.