Mozilla Common Voice Spontaneous Speech ASR Shared Task Test Data
License:
CC0-1.0
Steward:
Common VoiceTask: ASR
Release Date: 12/1/2025
Format: MP3, TSV
Size: 784.80 MB
Share
Description
A bundle of the held-out test data for the Mozilla Common Voice Spontaneous Speech ASR shared task.
Specifics
Considerations
Restrictions/Special Constraints
You agree that you will not re-host or re-share this dataset
Forbidden Usage
You agree not to attempt to determine the identity of speakers in the Common Voice dataset
Processes
Intended Use
This dataset is intended to be used for generating test predictions using automatic speech recognition (ASR) models.
Metadata
This dataset contains the held-out test data for the Mozilla Common Voice Spontaneous Speech ASR shared task.
Languages
The following languages' test data is included in this dataset: aln, bew, bxk, cgg, el-CY, hch, kcn, koo, led, lke, lth, meh, mmc, pne, ruc, rwm, sco, tob, top, ttj, ukv, ady, bas, kbd, qxp, ush.
Structure
The data is laid out as follows, with one directory containing all of the test audio files, and three directories corresponding to tasks 1 (multilingual general), 3 (best improvement with a small model), and 4 (unseen languages).
audios/ # all audio files for all languages/tasks
spontaneous-speech-ady-67085.mp3
spontaneous-speech-ady-67086.mp3
...
multilingual-general/ # audio paths for the 21 languages of task1
aln.tsv
bew.tsv
...
small-model/ # audio paths for every language
ady.tsv
aln.tsv
...
unseen-langs/ # audio paths for unseen languages
ady.tsv
...
Each .tsv file contains an "audio_file" column, with the name of the audio file for the sample (contained in the audios directory), and an empty "sentence" column, which you should populate with your predicted predictions. If you don't wish to submit predictions for a language, just leave the file as is (with an empty "sentence" column").
Example dataset loading:
audio_dir = "/path/to/test/data/audios/"
def expand_audio_path(p):
p["audio_path"] = os.path.join(audio_dir, p["audio_file"])
return p
dataset = (
load_dataset("csv", data_files=tsv_file_path, delimiter="\t")
.map(expand_audio_path)
.cast_column("audio_file", Audio(sampling_rate=16000)) # if using a model that expects a sampling rate different than 32k
.rename_column("audio_file", "audio")
)
Submitting your predictions
In order to submit your predictions on Codabench, your submission should be a zip file of the three directories included here (ignoring the audio directory).
You should add a second column to each tsv with the transcriptions corresponding to the audio file in the first column.
If you choose not to submit predictions for any languages, you can either remove the tsv file or leave it as is (for the aggregate metric in task 1, missing languages will be given a WER of 1.0).
Compress the three directories (multilingual-general, small-model, and unseen-langs) into a single zip file (with, e.g., zip -r my_submission.zip *), and submit via the "My Submissions" tab in the competition page.
Evaluation
Your submission will receive one score per task (4 total) that will be reported in the codabench leaderboard:
Multilingual General: The average WER for all 21 languages for which spontaneous speech training data was provided.
Biggest Improvement over Baseline: For every language, the difference between our baseline WER and your WER for that language will be calculated. The language with the biggest improvement over our baseline will be selected, and the difference in WER will be reported here.
Best Small Model: Same as the above, but only considers results in the "small_model" directory, where your results should have been generated with a model under 500MB in size.
Unseen Languages: The average WER for the 5 "unseen" languages (languages for which no spontaneous speech training data was released)
More information (like individual language scores), will be visible in the submission logs.