Podcast Homostoria (Indonesia)

License icon

License:

CC-BY-SA-4.0

Shield icon

Steward:

Community

Task: ASR

Release Date: 11/25/2025

Format: mp3

Size: 302.97 MB


Share

Description

This dataset features discussions on modern media—including film, podcasts, and social media—and its connection to local customs and traditions. The conversations are primarily in Indonesian, with frequent code-switching between English and Javanese.

Specifics

Licensing

Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)

https://spdx.org/licenses/CC-BY-SA-4.0.html

Metadata

This dataset is derived from the Homostoria podcast. It features conversations primarily conducted in Indonesian, with frequent code-switching between English and Javanese.

Language

Bahasa Indonesia - Indonesian (id)

Domain

Global and local modern media discussions.

Size

This dataset contains 11 hours of spontaneous speech within 16 audio files.

Process

This dataset is transcribed with automatic transcription tool (Transkriptor) and reviewed manually by linguist native speakers.

Fields

Columns in the .tsv file contains the following information:

  • "audio file": the name of audio files

  • "start": time when speech begins

  • "end": time when speech begins

  • "text": speech transcriptions

Sample

Ya, secure lah. 
Ya, at least secure misal kayak gitu. 
Jadi mungkin pemaknaan gitu ya. 
Mungkin yang kita bawa itu pemaknaan bahwa self-help ini nggak hanya hal-hal yang seperti itu gitu. 
Tapi mungkin lebih luas gak sih? Kalau menurutmu gimana nih, Hans?