Mozilla Common Voice Spontaneous Speech ASR Shared Task Train/Dev Data

Locale: mul

Size: 4.30 GB

Task: ASR

Format: mp3

License: CC-0


Data for the Mozilla Common Voice Spontaneous Speech ASR Shared Task

This datasheet is for the set of Mozilla Common Voice spontaneous speech datasets to be used in the Mozilla Data Collective Shared Task on Spontaneous Speech. It contains a train and a validation set for each of 21 languages. The dataset for each included language can be found individually on Mozilla Data Collective as well.

The test sets will be released separately at a later date.

The following is the list of languages included, with links to each language's corresponding datasheet on Mozilla Data Collective (MDC):

  • Gegnisht — Gheg Albanian (aln) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Gheg Albanian (aln). The dataset contains 11 hours of recorded speech (11 hours validated) from 14 speakers.

  • Betawi — Betawi (bew) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Betawi (bew). The dataset contains 11 hours of recorded speech (11 hours validated) from 21 speakers.

  • Bukusu — Bukusu (bxk) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Bukusu (bxk). The dataset contains 2934 clips representing 15 hours of recorded speech (11 hours validated) from 27 speakers.

  • Cypriot Greek — Cypriot Greek (el-CY) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Cypriot Greek (el-CY). The dataset contains 1284 clips representing 11 hours of recorded speech (11 hours validated) from 10 speakers.

  • Wixárika — Wixárika (hch) Esta ficha técnica corresponde a la versión 1.0 del conjunto de datos Spontaneous Speech (habla espontánea) de Mozilla Common Voice para Wixárika (hch). El conjunto de datos contiene 1553 representando 11 horas de grabaciones (11 horas validadas) de 10 hablantes.

  • Nubi — Nubi (kcn) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Nubi (kcn). The dataset contains 2719 clips representing 15 hours of recorded speech (10 hours validated) from 26 speakers.

  • Konzo — Konzo (koo) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Konzo (koo). The dataset contains 3255 clips representing 15 hours of recorded speech (11 hours validated) from 28 speakers.

  • Lendu — Lendu (led) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Lendu (led). The dataset contains 2882 clips representing 16 hours of recorded speech (11 hours validated) from 26 speakers.

  • Kenyi — Kenyi (lke) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Kenyi (lke). The dataset contains 2791 clips representing 13 hours of recorded speech (11 hours validated) from 26 speakers.

  • Thur — Thur (lth) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Thur (lth). The dataset contains 3238 clips representing 34 hours of recorded speech (11 hours validated) from 29 speakers.

  • Mixteco Yucuhiti — Southwestern Tlaxiaco Mixtec (meh) Esta ficha técnica corresponde a la versión 1.0 del conjunto de datos Spontaneous Speech (habla espontánea) de Mozilla Common Voice para Southwestern Tlaxiaco Mixtec (meh). El conjunto de datos contiene 1057 representando 11 horas de grabaciones (11 horas validadas) de 16 hablantes.

  • Jñatjo — Michoacán Mazahua (mmc) Esta ficha técnica corresponde a la versión 1.0 del conjunto de datos Spontaneous Speech (habla espontánea) de Mozilla Common Voice para Michoacán Mazahua (mmc). El conjunto de datos contiene 12 horas de grabaciones (12 horas validadas) de 12 hablantes.

  • Western Penan — Western Penan (pne) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Western Penan (pne). The dataset contains 2630 clips representing 13 hours of recorded speech (13 hours validated) from 24 speakers.

  • Ruuli — Ruuli (ruc) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Ruuli (ruc). The dataset contains 2868 clips representing 18 hours of recorded speech (11 hours validated) from 26 speakers.

  • Amba — Amba (rwm) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Amba (rwm). The dataset contains 2443 clips representing 14 hours of recorded speech (11 hours validated) from 21 speakers.

  • Scots — Scots (sco) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Scots (sco). The dataset contains 715 clips representing 12 hours of recorded speech (11 hours validated) from 21 speakers.

  • Toba Qom — Toba Qom (tob) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Toba Qom (tob). The dataset contains 1611 clips representing 11 hours of recorded speech (11 hours validated) from 25 speakers.

  • Papantla Totonac — Papantla Totonac (top) Esta ficha técnica corresponde a la versión 1.0 del conjunto de datos Spontaneous Speech (habla espontánea) de Mozilla Common Voice para Papantla Totonac (top). El conjunto de datos contiene 411 representando 11 horas de grabaciones (11 horas validadas) de 10 hablantes.

  • Rutoro — Rutoro (ttj) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Rutoro (ttj). The dataset contains 3113 clips representing 17 hours of recorded speech (11 hours validated) from 26 speakers.

  • Kuku — Kuku (ukv) This datasheet is for version 1.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Kuku (ukv). The dataset contains 2586 clips representing 12 hours of recorded speech (11 hours validated) from 22 speakers.

Stats

The following table provides the amount of data (in hours and minutes) in the train and dev splits for each language Note that not all of this data corresponds to validated transcriptions.

LanguageTrainDev
aln6h 40m2h 14m
bew5h 47m3h 4m
bxk10h 14m1h 58m
cgg8h 2m1h 58m
el-CY6h 47m2h 2m
hch6h 17m1h 43m
kcn10h 29m1h 47m
koo11h 26m1h 51m
led11h 56m1h 35m
lke8h 30m2h 7m
lth12h1h 36m
meh6h 43m1h 45m
mmc7h 11m2h 16m
pne8h 20m2h 2m
ruc14h 13m1h 50m
rwm10h1h 53m
sco6h 50m1h 44m
tob4h 27m2h 43m
top5h 37m2h 21m
ttj12h 25m2h 12m
ukv8h 29m1h 35m

Recommended Post-Processing

There are a few details about this data that are worth paying attention to:

  • The "votes" column indicates whether a given transcription has been approved/validated. Not all of the transcriptions in this dataset have been validated, and in such cases the transcription is not guaranteed to be correct or accurate. Nonetheless, it is possible that such transcriptions (and corresponding audio) are useful.

  • Check the "duration" column. Some audios are listed with a duration of 0. These should be excluded.

  • You may want to remove symbols representing disfluencies (e.g. "...") or unintelligible speech (e.g. "(???)"). We will remove such symbols from the test data.

License

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree:

  • to not determine the identity of speakers in the dataset

  • that you will not re-host or re-share this dataset.

Mozilla Common Voice Spontaneous Speech ASR Shared Task Train/Dev Data