Common Voice Spontaneous Speech 2.0 - Toba Qom

Specifics

Licensing

CC0 1.0 Universal

https://creativecommons.org/publicdomain/zero/1.0/legalcode

Considerations

Forbidden Usage

It is forbidden to attempt to determine the identity of speakers in the common Voice datasets. It is forbidden to re-host or re-share this dataset

Processes

Intended Use

This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.

Metadata

Toba Qom — Toba Qom (`tob`)

This datasheet is for version 2.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Toba Qom (tob). The dataset contains 1573 clips representing 12 hours of recorded speech (11 hours validated) from 25 speakers.

Language

The Toba Qom language is an endangered language spoken in Gran Chaco, a region spanned over Argentina, Paraguay and Bolivia. As per the official demographic data provided by the Argentinian state, the population of Qom individuals is estimated at 80,000, of which approximately 49% are speakers of the oral form of the language. The term "qom" describes a population that has traditionally been arranged into multiple extended families or groups. Language and sociocultural traits that are essential to qom culture are shared by these groups, which are traditionally hunter-gatherer.

Variants

The contributors to this corpus originate from Chaco and Formosa provinces in Argentina. This area encompasses four ethnodialectal subregions with distinct self-identification terms (Messineo, 1991) 1.

Area	Province	Locations	Variant (self-identification)
Northwest	Chaco	El Colchwón, El Espinillo and the Bermejo river’s surroundings	dapigemlʔek
Northcenter	Chaco	Pampa del Indio	noʔolgaGanaq
Southcenter	Chaco	Sáenz Peña, Machahay, Quitilipi	lʔañaGashek
Southeast	Chaco, Eastern Formosa	Las Palmas, Clorinda	takshek

For further information, see 2 1 3.

Data splits for modelling

Split	Count
Train	946
Test	367
Dev	341

Transcriptions

Prompts: 136
Duration: 11:02:08 [h:m:s]
Avg. Transcription Len: 133
Avg. Duration: 25.26[s]
Valid Duration: 38546.136[s]
Total hours: 11.04[h]
Valid hours: 10.71[h]

This corpus consists of 1350 utterances approximately, totalling 10hs of transcribed speech. The dataset does not focus on any particular domain or topics. The question set consists of 150 instances, covering a wide range of general topics about lifestyles and culture (hobbies, education, traditions, nature, food, society, technology, relationships, art, etc). Speakers responded to each question based on their personal belief, experiences, and knowledge, mainly to describe their culture or share their personal opinion about how they interact within the society (e.g. how they would find a lawyer, how they make a medical appointment, etc).

The dataset does not contain any data that might be considered sensitive for others, to the best of the author's knowledge.

The data collection involved a coordinator (a PhD student), a linguist known by the Qom contributors (researcher), and three field-work assistants (linguists). The data was collected mainly by the Qom contributors using their own phones at home, after receiving technical training. A small proportion of data was recorded in an academic setting (e.g. research institute) during the training phase.

Writing system

The transcriptions follows the orthographic systems proposed by Buckwalter (2001) 2

Symbol table

a c ch d e g hu i j l ll m n ñ o p q qu r s sh t u v x y ỹ ’

Samples

Questions

There follows a randomly selected sample of questions used in the corpus.

¿Negue’t ca ỹataqta ’anqopita can ñaq edaxaatac?
¿Negue’t na dalaxaic no’onatac ’auaỹaten da ’au’ot nagui?
¿’Eetec ca n’qochenaxac na shiỹaxaupi yi ’adma’ nquicapiguic da qalota?
Eetec na napo yi arma?
¿Negue’t taxa ca ‘auauotaique da lmenec ca qoueta’a ’adma’?

Responses

There follows a randomly selected sample of transcribed responses from the corpus.

Aýem ýataqta yoqopita can ñaq sedaxaatac satapoigui asa'aso paxaguenaxaqui dam ýataqta yoqopita naxa da qai'axaia asa'aso huo'o da l-lamaxa ichoxot da ne'enapi napaxaguetacpi ñaqpiolec cad'ac nmatec cada'ac da'ashe nache da qai'axaia asa'aso natoina nache huo'o da lpe'e na ñaqpiolec naxa eso so ýoqta yoqopita
Aýem na saýaten nagui saýaten da ño'oxosheguem na noýic
Da huo'o ña nqui'c shiyaxaua cha'aye nachena na jec ilotaique cam l'onatac dam yanatac cam ilotaique nache ishet da machiguiñi
yi imaa' chaco onaxaic shinatap .cahioloqta da nopo
Yi 'ima' qaiuen aca menaxanaxaqui da ýotta'a't na lmenec nallec

Fields

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
audio_id - numeric id for audio file
audio_file - audio file name
duration_ms - duration of audio in milliseconds
prompt_id - numeric id for prompt
prompt - question for user
transcription - transcription of the audio response
votes - number of people that who approved a given transcript
age - age of the speaker4
gender - gender of the speaker4
language - language name
split - for data modelling, which subset of the data does this clip pertain to
char_per_sec - how many characters of transcription per second of audio
quality_tags - some automated assessment of the transcription--audio pair, separated by |
- transcription-length - character per second under 3 characters per second
- speech-rate - characters per second over 30 characters per second
- short-audio - audio length under 2 seconds
- long-audio - audio length over 30 seconds

Get involved!

Community links

Common Voice translators on Pontoon

Contribute

Acknowledgements

Datasheet authors

Belu Ticona <mticonao@gmu.edu>
Paola Cúneo
Antonios Anastasopoulos

Citation guidelines

B. Ticona, P. Cuneo. A. Anastasopoulos. “Datasheet of Spontaneous Speech Corpus for Qom - Mozilla Common Voice”. Revised on Aug 29th, 2025. [Publication Date].

Funding

The speaker collaborators were funded by Mozilla Common Voice. The project coordinator was partially funded by the US NSF grants 2346334 and 2439202.

This dataset was partially funded by the Open Multilingual Speech Fund managed by Mozilla Common Voice.

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

Messineo, Cristina. 1991. Variantes dialectales del complejo lingüístico toba. Hacia una nueva carta étnica del Gran Chaco II: 12-22. Las Lomitas: Centro del Hombre Antiguo Chaqueño. ↩ ↩2
Buckwalter, Alberto. 2001 [1980]). Vocabulario toba. Formosa / Indiana: Equipo Menonita / Mennonite Board of Missions. Ed. Revisada. ↩ ↩2
Messineo, Cristina. 2003. Lengua Toba (guaycurú). Aspectos gramaticales y discursivos. Lincom Studies in Native American Linguistics 48. Münich: Lincom Europa. ↩
For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2

Common Voice Spontaneous Speech 2.0 - Toba Qom

Description

Specifics

Considerations

Processes

Metadata

Toba Qom — Toba Qom (`tob`)

Language

Variants

Data splits for modelling

Transcriptions

Writing system

Symbol table

Samples

Questions

Responses

Recommended post-processing

Fields

Get involved!

Community links

Contribute

Acknowledgements

Datasheet authors

Citation guidelines

Funding

Licence

Footnotes

Common Voice Spontaneous Speech 2.0 - Toba Qom

Description

Specifics

Considerations

Processes

Metadata

Toba Qom — Toba Qom (tob)

Language

Variants

Data splits for modelling

Transcriptions

Writing system

Symbol table

Samples

Questions

Responses

Recommended post-processing

Fields

Get involved!

Community links

Contribute

Acknowledgements

Datasheet authors

Citation guidelines

Funding

Licence

Footnotes

Toba Qom — Toba Qom (`tob`)