Mandar Spontaneous Speech

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

Community

Task: ASR

Release Date: 2/2/2026

Format: MP3, TSV

Size: 534.45 MB


Share

Description

Mandar Spontaneous Speech is a representative dataset of the Mandar language and contains a variety of dialects, particularly those used in Majene and Polewali Mandar. It also includes Mandar–Indonesian code-switching varieties that reflect both traditional and modern forms of speech. This dataset can be used for linguistic research, cultural documentation, urban sociolinguistic studies, and the development of language technologies based on regional languages with Indonesian code-mixing.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

This dataset is limited to academic and industrial research. It is forbidden to use it for commercial implementation. Regarding industrial research, we accept any compensation.

Forbidden Usage

It is forbidden to use this dataset for commercial use.

Processes

Ethical Review

Each participant was informed about the data collection process and voluntarily provided informed consent to participate. Participants retained the right to withdraw from the dataset at any time by contacting the authors.

Metadata

Language:

This dataset represents Majene and Polewali Mandar dialects with Indonesian code-switching, spoken by male and female speakers aged between their 20s and 30s.

Source(s):

The dataset consists of spontaneous speech data collected by the dataset owners by inviting participants to provide oral responses to a series of questions.

Domain(s):

This data set consists of general topics including daily life activity, education, cultural ceremonial, language and identity, physical and mental state, flora and fauna, weather and season, holiday, folklore, local-horror story, social media, local and national current news.

Size:

10 hours, 603 MB

Structure:

Columns in the .tsv file contains the following information:

"audio file": the name of audio files

"text": speech transcriptions

Sample:

“Nah satu-satunna tania satu-satunna salah satu dari maiqdi cara mala dipake toh malestarikan bahasa daerah muaq iyau simata maqbahasa mandaraq diniq di oroangu karena memang siqita tau parattaq, siqita tau meskipun indangi sittengang daerah tapi pole mai tau di siola paqbanuai tau maqbahasa mandari tau. Meskipun malai tau maqbahasa Indonesia toh, malai tau mengikut bahasana to diniq dioroi diteqe di banuanna tau e tapi andangi.”

“Terus berita-berita atau kabar-kabar diteqe yang paling berkesan, indang tobandi disanga berkesan sih cuma ya simata uingarang diqo.”

“Mua peristiwa horor yang terjadi diteqe ya indnagaq rua maqalaami iyau sih cuma yang uirrangi diqo solau maqalami nacaritangangaq.”

“Terus diqe hobi malai mejadi sumber atau ladang pendapatanna tau laeng ya menurut u iyau setelah maqita diqo tau-tau naposting diqo di media sosial maiqdi diqo ma apresiasi i.”

“Momen diqo wattu muaq sadarma hobi diqo mulaimaq bosan manjalani diqo hobi o ya diang satu wattu ya jenuhmaq diqo jalani diqo hobi toh, indammi mario usaqding manjalani diqo hobi o.”

Writing System:

Latin alphabet (A–Z), Arabic numerals (0–9)