Mozilla Common Voice Spontaneous Speech ASR Shared Task Train/Dev Data

Specifics

Licensing

CC0-1.0

https://creativecommons.org/publicdomain/zero/1.0/legalcode

Considerations

Restrictions/Special Constraints

You agree that you will not re-host or re-share this dataset

Forbidden Usage

You agree not to attempt to determine the identity of speakers in the Common Voice dataset

Processes

Intended Use

This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.

Metadata

Data for the Mozilla Common Voice Spontaneous Speech ASR Shared Task

This datasheet is for the set of Mozilla Common Voice spontaneous speech datasets to be used in the Mozilla Data Collective Shared Task on Spontaneous Speech. It contains a train and a validation set for each of 21 languages. The dataset for each included language can be found individually on Mozilla Data Collective as well.

The test sets will be released separately at a later date.

The following is the list of languages included, with links to each language's corresponding datasheet on Mozilla Data Collective (MDC):

Gegnisht — Gheg Albanian (aln) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Gheg Albanian (aln). The dataset contains 11 hours of recorded speech (11 hours validated) from 14 speakers.
Betawi — Betawi (bew) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Betawi (bew). The dataset contains 11 hours of recorded speech (11 hours validated) from 21 speakers.
Bukusu — Bukusu (bxk) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Bukusu (bxk). The dataset contains 2934 clips representing 15 hours of recorded speech (11 hours validated) from 27 speakers.
Cypriot Greek — Cypriot Greek (el-CY) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Cypriot Greek (el-CY). The dataset contains 1284 clips representing 11 hours of recorded speech (11 hours validated) from 10 speakers.
Wixárika — Wixárika (hch) Esta ficha técnica corresponde a la versión 1.0 del conjunto de datos Spontaneous Speech (habla espontánea) de Mozilla Common Voice para Wixárika (hch). El conjunto de datos contiene 1553 representando 11 horas de grabaciones (11 horas validadas) de 10 hablantes.
Nubi — Nubi (kcn) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Nubi (kcn). The dataset contains 2719 clips representing 15 hours of recorded speech (10 hours validated) from 26 speakers.
Konzo — Konzo (koo) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Konzo (koo). The dataset contains 3255 clips representing 15 hours of recorded speech (11 hours validated) from 28 speakers.
Lendu — Lendu (led) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Lendu (led). The dataset contains 2882 clips representing 16 hours of recorded speech (11 hours validated) from 26 speakers.
Kenyi — Kenyi (lke) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Kenyi (lke). The dataset contains 2791 clips representing 13 hours of recorded speech (11 hours validated) from 26 speakers.
Thur — Thur (lth) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Thur (lth). The dataset contains 3238 clips representing 34 hours of recorded speech (11 hours validated) from 29 speakers.
Mixteco Yucuhiti — Southwestern Tlaxiaco Mixtec (meh) Esta ficha técnica corresponde a la versión 1.0 del conjunto de datos Spontaneous Speech (habla espontánea) de Mozilla Common Voice para Southwestern Tlaxiaco Mixtec (meh). El conjunto de datos contiene 1057 representando 11 horas de grabaciones (11 horas validadas) de 16 hablantes.
Jñatjo — Michoacán Mazahua (mmc) Esta ficha técnica corresponde a la versión 1.0 del conjunto de datos Spontaneous Speech (habla espontánea) de Mozilla Common Voice para Michoacán Mazahua (mmc). El conjunto de datos contiene 12 horas de grabaciones (12 horas validadas) de 12 hablantes.
Western Penan — Western Penan (pne) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Western Penan (pne). The dataset contains 2630 clips representing 13 hours of recorded speech (13 hours validated) from 24 speakers.
Ruuli — Ruuli (ruc) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Ruuli (ruc). The dataset contains 2868 clips representing 18 hours of recorded speech (11 hours validated) from 26 speakers.
Amba — Amba (rwm) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Amba (rwm). The dataset contains 2443 clips representing 14 hours of recorded speech (11 hours validated) from 21 speakers.
Scots — Scots (sco) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Scots (sco). The dataset contains 715 clips representing 12 hours of recorded speech (11 hours validated) from 21 speakers.
Toba Qom — Toba Qom (tob) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Toba Qom (tob). The dataset contains 1611 clips representing 11 hours of recorded speech (11 hours validated) from 25 speakers.
Papantla Totonac — Papantla Totonac (top) Esta ficha técnica corresponde a la versión 1.0 del conjunto de datos Spontaneous Speech (habla espontánea) de Mozilla Common Voice para Papantla Totonac (top). El conjunto de datos contiene 411 representando 11 horas de grabaciones (11 horas validadas) de 10 hablantes.
Rutoro — Rutoro (ttj) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Rutoro (ttj). The dataset contains 3113 clips representing 17 hours of recorded speech (11 hours validated) from 26 speakers.
Kuku — Kuku (ukv) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Kuku (ukv). The dataset contains 2586 clips representing 12 hours of recorded speech (11 hours validated) from 22 speakers.

Stats

The following table provides the amount of data (in hours and minutes) in the train and dev splits for each language Note that not all of this data corresponds to validated transcriptions.

Language	Train	Dev
aln	6h 40m	2h 14m
bew	5h 47m	3h 4m
bxk	10h 14m	1h 58m
cgg	8h 2m	1h 58m
el-CY	6h 47m	2h 2m
hch	6h 17m	1h 43m
kcn	10h 29m	1h 47m
koo	11h 26m	1h 51m
led	11h 56m	1h 35m
lke	8h 30m	2h 7m
lth	12h	1h 36m
meh	6h 43m	1h 45m
mmc	7h 11m	2h 16m
pne	8h 20m	2h 2m
ruc	14h 13m	1h 50m
rwm	10h	1h 53m
sco	6h 50m	1h 44m
tob	4h 27m	2h 43m
top	5h 37m	2h 21m
ttj	12h 25m	2h 12m
ukv	8h 29m	1h 35m

License

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree:

to not determine the identity of speakers in the dataset
that you will not re-host or re-share this dataset.