Mozilla Common Voice Spontaneous Speech ASR Shared Task Test Data

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Restrictions/Special Constraints

You agree that you will not re-host or re-share this dataset

Forbidden Usage

You agree not to attempt to determine the identity of speakers in the Common Voice dataset

Processes

Intended Use

This dataset is intended to be used for generating test predictions using automatic speech recognition (ASR) models.

Metadata

This dataset contains the held-out test data for the Mozilla Common Voice Spontaneous Speech ASR shared task.

Languages

The following languages' test data is included in this dataset: aln, bew, bxk, cgg, el-CY, hch, kcn, koo, led, lke, lth, meh, mmc, pne, ruc, rwm, sco, tob, top, ttj, ukv, ady, bas, kbd, qxp, ush.

Structure

The data is laid out as follows, with one directory containing all of the test audio files, and three directories corresponding to tasks 1 (multilingual general), 3 (best improvement with a small model), and 4 (unseen languages).

audios/ # all audio files for all languages/tasks
- spontaneous-speech-ady-67085.mp3
- spontaneous-speech-ady-67086.mp3
- ...
multilingual-general/ # audio paths for the 21 languages of task1
- aln.tsv
- bew.tsv
- ...
small-model/ # audio paths for every language
- ady.tsv
- aln.tsv
- ...
unseen-langs/ # audio paths for unseen languages
- ady.tsv
- ...

Each .tsv file contains an "audio_file" column, with the name of the audio file for the sample (contained in the audios directory), and an empty "sentence" column, which you should populate with your predicted predictions. If you don't wish to submit predictions for a language, just leave the file as is (with an empty "sentence" column").

Example dataset loading:

audio_dir = "/path/to/test/data/audios/"
def expand_audio_path(p):
    p["audio_path"] = os.path.join(audio_dir, p["audio_file"])
    return p

dataset = (
    load_dataset("csv", data_files=tsv_file_path, delimiter="\t")
    .map(expand_audio_path)
    .cast_column("audio_file", Audio(sampling_rate=16000))  # if using a model that expects a sampling rate different than 32k
    .rename_column("audio_file", "audio")
)

Submitting your predictions

In order to submit your predictions on Codabench, your submission should be a zip file of the three directories included here (ignoring the audio directory). You should add a second column to each tsv with the transcriptions corresponding to the audio file in the first column. If you choose not to submit predictions for any languages, you can either remove the tsv file or leave it as is (for the aggregate metric in task 1, missing languages will be given a WER of 1.0). Compress the three directories (multilingual-general, small-model, and unseen-langs) into a single zip file (with, e.g., zip -r my_submission.zip *), and submit via the "My Submissions" tab in the competition page.

Evaluation

Your submission will receive one score per task (4 total) that will be reported in the codabench leaderboard:

Multilingual General: The average WER for all 21 languages for which spontaneous speech training data was provided.
Biggest Improvement over Baseline: For every language, the difference between our baseline WER and your WER for that language will be calculated. The language with the biggest improvement over our baseline will be selected, and the difference in WER will be reported here.
Best Small Model: Same as the above, but only considers results in the "small_model" directory, where your results should have been generated with a model under 500MB in size.
Unseen Languages: The average WER for the 5 "unseen" languages (languages for which no spontaneous speech training data was released)

More information (like individual language scores), will be visible in the submission logs.