Mozilla Common Voice Spontaneous Speech ASR Shared Task Train/Dev Data
Locale: mul
Size: 4.30 GB
Task: ASR
Format: mp3
License: CC-0
Data for the Mozilla Common Voice Spontaneous Speech ASR Shared Task
This datasheet is for the set of Mozilla Common Voice spontaneous speech datasets to be used in the Mozilla Data Collective Shared Task on Spontaneous Speech. It contains a train and a validation set for each of 21 languages. The dataset for each included language can be found individually on Mozilla Data Collective as well.
The test sets will be released separately at a later date.
The following is the list of languages included, with links to each language's corresponding datasheet on Mozilla Data Collective (MDC):
Gegnisht — Gheg Albanian (
aln
) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Gheg Albanian (aln
). The dataset contains 11 hours of recorded speech (11 hours validated) from 14 speakers.Betawi — Betawi (
bew
) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Betawi (bew
). The dataset contains 11 hours of recorded speech (11 hours validated) from 21 speakers.Bukusu — Bukusu (
bxk
) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Bukusu (bxk
). The dataset contains 2934 clips representing 15 hours of recorded speech (11 hours validated) from 27 speakers.Cypriot Greek — Cypriot Greek (
el-CY
) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Cypriot Greek (el-CY
). The dataset contains 1284 clips representing 11 hours of recorded speech (11 hours validated) from 10 speakers.Wixárika — Wixárika (
hch
) Esta ficha técnica corresponde a la versión 1.0 del conjunto de datos Spontaneous Speech (habla espontánea) de Mozilla Common Voice para Wixárika (hch
). El conjunto de datos contiene 1553 representando 11 horas de grabaciones (11 horas validadas) de 10 hablantes.Nubi — Nubi (
kcn
) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Nubi (kcn
). The dataset contains 2719 clips representing 15 hours of recorded speech (10 hours validated) from 26 speakers.Konzo — Konzo (
koo
) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Konzo (koo
). The dataset contains 3255 clips representing 15 hours of recorded speech (11 hours validated) from 28 speakers.Lendu — Lendu (
led
) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Lendu (led
). The dataset contains 2882 clips representing 16 hours of recorded speech (11 hours validated) from 26 speakers.Kenyi — Kenyi (
lke
) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Kenyi (lke
). The dataset contains 2791 clips representing 13 hours of recorded speech (11 hours validated) from 26 speakers.Thur — Thur (
lth
) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Thur (lth
). The dataset contains 3238 clips representing 34 hours of recorded speech (11 hours validated) from 29 speakers.Mixteco Yucuhiti — Southwestern Tlaxiaco Mixtec (
meh
) Esta ficha técnica corresponde a la versión 1.0 del conjunto de datos Spontaneous Speech (habla espontánea) de Mozilla Common Voice para Southwestern Tlaxiaco Mixtec (meh
). El conjunto de datos contiene 1057 representando 11 horas de grabaciones (11 horas validadas) de 16 hablantes.Jñatjo — Michoacán Mazahua (
mmc
) Esta ficha técnica corresponde a la versión 1.0 del conjunto de datos Spontaneous Speech (habla espontánea) de Mozilla Common Voice para Michoacán Mazahua (mmc
). El conjunto de datos contiene 12 horas de grabaciones (12 horas validadas) de 12 hablantes.Western Penan — Western Penan (
pne
) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Western Penan (pne
). The dataset contains 2630 clips representing 13 hours of recorded speech (13 hours validated) from 24 speakers.Ruuli — Ruuli (
ruc
) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Ruuli (ruc
). The dataset contains 2868 clips representing 18 hours of recorded speech (11 hours validated) from 26 speakers.Amba — Amba (
rwm
) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Amba (rwm
). The dataset contains 2443 clips representing 14 hours of recorded speech (11 hours validated) from 21 speakers.Scots — Scots (
sco
) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Scots (sco
). The dataset contains 715 clips representing 12 hours of recorded speech (11 hours validated) from 21 speakers.Toba Qom — Toba Qom (
tob
) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Toba Qom (tob
). The dataset contains 1611 clips representing 11 hours of recorded speech (11 hours validated) from 25 speakers.Papantla Totonac — Papantla Totonac (
top
) Esta ficha técnica corresponde a la versión 1.0 del conjunto de datos Spontaneous Speech (habla espontánea) de Mozilla Common Voice para Papantla Totonac (top
). El conjunto de datos contiene 411 representando 11 horas de grabaciones (11 horas validadas) de 10 hablantes.Rutoro — Rutoro (
ttj
) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Rutoro (ttj
). The dataset contains 3113 clips representing 17 hours of recorded speech (11 hours validated) from 26 speakers.Kuku — Kuku (
ukv
) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Kuku (ukv
). The dataset contains 2586 clips representing 12 hours of recorded speech (11 hours validated) from 22 speakers.
Stats
The following table provides the amount of data (in hours and minutes) in the train and dev splits for each language Note that not all of this data corresponds to validated transcriptions.
Language | Train | Dev |
---|---|---|
aln | 6h 40m | 2h 14m |
bew | 5h 47m | 3h 4m |
bxk | 10h 14m | 1h 58m |
cgg | 8h 2m | 1h 58m |
el-CY | 6h 47m | 2h 2m |
hch | 6h 17m | 1h 43m |
kcn | 10h 29m | 1h 47m |
koo | 11h 26m | 1h 51m |
led | 11h 56m | 1h 35m |
lke | 8h 30m | 2h 7m |
lth | 12h | 1h 36m |
meh | 6h 43m | 1h 45m |
mmc | 7h 11m | 2h 16m |
pne | 8h 20m | 2h 2m |
ruc | 14h 13m | 1h 50m |
rwm | 10h | 1h 53m |
sco | 6h 50m | 1h 44m |
tob | 4h 27m | 2h 43m |
top | 5h 37m | 2h 21m |
ttj | 12h 25m | 2h 12m |
ukv | 8h 29m | 1h 35m |
Recommended Post-Processing
There are a few details about this data that are worth paying attention to:
The "votes" column indicates whether a given transcription has been approved/validated. Not all of the transcriptions in this dataset have been validated, and in such cases the transcription is not guaranteed to be correct or accurate. Nonetheless, it is possible that such transcriptions (and corresponding audio) are useful.
Check the "duration" column. Some audios are listed with a duration of 0. These should be excluded.
You may want to remove symbols representing disfluencies (e.g. "...") or unintelligible speech (e.g. "(???)"). We will remove such symbols from the test data.
License
This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree:
to not determine the identity of speakers in the dataset
that you will not re-host or re-share this dataset.
Mozilla Common Voice Spontaneous Speech ASR Shared Task Train/Dev Data