Common Voice Spontaneous Speech 1.0 - Betawi

Specifics

Licensing

CC0-1.0

https://creativecommons.org/publicdomain/zero/1.0/legalcode

Considerations

Restrictions/Special Constraints

You agree that you will not re-host or re-share this dataset

Forbidden Usage

You agree not to attempt to determine the identity of speakers in the Common Voice dataset

Processes

Intended Use

This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.

Metadata

Betawi — Betawi (`bew`)

This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Betawi (bew). The dataset contains 11 hours of recorded speech (11 hours validated) from 21 speakers.

Language

Betawi language originally belongs to Austronesian language with a full name of Melayu-Betawi. This language is considered as one of Malay dialects, but historically it grew together with other major languages, such as Arabic, Hokkien, Sundanese, Javanese, and Malay in Sumatra - a tiny portion with Portuguese and Dutch. The language vitality status is Endangered according to https://www.ethnologue.com/language/bew/. At the moment, Indonesian standard and English in general influence the native speakers, allowing code switching and code mixing happens in a spontaneous speech. The specific variation of this dataset is Betawi Ora or Betawi Pinggiran (Peripheral Betawi), taken from several locations of Bekasi District/City, West Java Province, Indonesia. This variation is unique in terms of geo-politics: language is spoken only in the community, but it is not taught at school. Instead, the community is taught Sundanese language, which is dominated in West Java Province in general.

Demographic information

The dataset includes the following distribution of age and gender.

Gender

Self-declared gender information, frequency refers to the number of clips annotated with this gender.

Age

Self-declared age information, frequency refers to the number of clips annotated with this age band.

Transcriptions

The transcription system uses general Latin script, but involves allophone variants of three /e/, these are /é/, /è/, and /e/.

Writing system

Historically, this language used Pegon, Arabic script, but now Latin is adapted.The writing system in this dataset uses general Latin script, but involves allophone variants of three /e/, these are /é/, /è/, and /e/.

Symbol table

a b c d é è ȇ e f g h i j k l m n o p q r s t u v w y z

Questions

There follows a randomly selected sample of transcribed responses from the corpus.

Begimané pendidikan keluargé di masyarakat sekitar Ente?
Pigimané masyarakat di lingkungan Ente ngejagé atow melestarikan alam di sekitar?
Menurut Ente déwék, seberapé besar peran tuh kesenian buat acaré khusus?
Elmu apé nyang bakalan penting buat dipelajarin di masé depan?
Begimané caré kité ngedidik generasi mudé biar lebih peduli ngejagé lingkungan?

Responses

There follows a randomly selected sample of transcribed responses from the corpus.

Kalo di sini mah pendidikan paling utama, ya minimal lulus SMA.
Jadinya di sini mah bagan emaknya pada kuli nandur yang penting anaknya sekolȇh.
Jadinya diusahain banget pendidikan di sini.
Bagan boleh utang kék, èmaknya boleh kuli nandur, kuli nyuci, yang penting anaknyé sekolȇh.

Community links

https://referensi.data.kemendikdasmen.go.id/budayakita/wbtb/objek/AA000491
https://petabahasa.kemdikbud.go.id/ (Web of peta bahasa does not consider Betawi language is part of Indonesia, particularly in Jakarta and West Jawa Province.

Contribute

Common Voice: Spontaneous Speech

Datasheet authors

Yacub Fahmilda
Riska Legistari Febri

Funding

This dataset was fully funded by the Open Multilingual Speech Fund.

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Common Voice Spontaneous Speech 1.0 - Betawi

Description

Specifics

Considerations

Processes

Metadata

Betawi — Betawi (`bew`)

Language

Demographic information

Gender

Age

Transcriptions

Writing system

Symbol table

Questions

Responses

Recommended post-processing

Community links

Contribute

Datasheet authors

Funding

Licence

Common Voice Spontaneous Speech 1.0 - Betawi

Description

Specifics

Considerations

Processes

Metadata

Betawi — Betawi (bew)

Language

Demographic information

Gender

Age

Transcriptions

Writing system

Symbol table

Questions

Responses

Recommended post-processing

Community links

Contribute

Datasheet authors

Funding

Licence

Betawi — Betawi (`bew`)