CV Korean Test 25.0 - Noise-Augmented (SCAI)

Description

This dataset is a noise-augmented version of the test split (test.csv) from the Mozilla Common Voice Scripted Speech 25.0 – Korean dataset, designed to support research in automatic speech recognition (ASR), particularly in noisy environments. The original clean speech dataset is already hosted on Mozilla Data Collective. Building upon this existing resource, we generate additional versions of the evaluation audio by applying various types of environmental noise while preserving the original transcriptions. The purpose of this dataset is to enable systematic evaluation of ASR systems under realistic acoustic conditions. The noise augmentation process simulates real-world scenarios such as background noise, allowing researchers to assess model robustness.

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Restrictions/Special Constraints

None provided.

Forbidden Usage

It is forbidden to attempt to determine the identity of speakers in the Common Voice datasets. It is forbidden to re-host or re-share this dataset.

Processes

Ethical Review

This dataset is derived from the Mozilla Common Voice dataset, where all contributors have provided informed consent for their recordings to be publicly shared and used for research purposes. No new data collection involving human participants was conducted for this dataset. The dataset consists of noise-augmented versions of the original audio while preserving the original transcriptions. No personally identifiable information (PII) has been added or modified. We adhere to the terms and conditions of the Mozilla Data Collective and ensure that the dataset is used solely for research and evaluation purposes. This dataset is intended for evaluating ASR systems under noisy conditions and does not introduce additional ethical risks beyond those of the original dataset.

Intended Use

This dataset is intended for evaluating ASR systems in noisy conditions, with a focus on robustness benchmarking rather than model training.

Metadata

Dataset Structure

This dataset is a noise-augmented derivative of the Mozilla Common Voice Scripted Speech 25.0 – Korean dataset and utilizes only its test split for evaluation purposes. Therefore, the original dataset documentation should be referred to for the full data structure and metadata specifications.

The dataset is provided in JSONL format, where each line corresponds to a single audio sample and its associated metadata.

JSONL Fields

index
A unique identifier for each data instance.
raw
The relative file path to the corresponding audio file.
question_ko
The ground-truth transcription of the audio in Korean.
etc
Additional metadata fields that may be included in each entry.

Example

{
  "index": "000001",
  "raw": "commonVoice_noise/common_voice_ko_39813694_noisy.mp3",
  "question_ko": "그러면 땅도 파보고 농부들과 함께 아무것이라도 배워 가면서 할 것 같았다.",
  "etc": { ... }
}