Common Voice Spontaneous Speech 2.0 - Toba Qom
License:
CC0-1.0
Steward:
Common Voice
Task: ASR
Release Date: 12/5/2025
Format: MP3
Size: 172.41 MB
Description
A collection of spontaneous spoken phrases in Toba Qom.
Specifics
Considerations
Forbidden Usage
It is forbidden to attempt to determine the identity of speakers in the common Voice datasets. It is forbidden to re-host or re-share this dataset
Processes
Intended Use
This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.
Metadata
Toba Qom — Toba Qom (tob)
This datasheet is for version 2.0 of the the Mozilla Common Voice Spontaneous Speech dataset
for Toba Qom (tob). The dataset contains 1573 clips representing 12 hours of recorded
speech (11 hours validated) from 25 speakers.
Language
The Toba Qom language is an endangered language spoken in Gran Chaco, a region spanned over Argentina, Paraguay and Bolivia. As per the official demographic data provided by the Argentinian state, the population of Qom individuals is estimated at 80,000, of which approximately 49% are speakers of the oral form of the language. The term "qom" describes a population that has traditionally been arranged into multiple extended families or groups. Language and sociocultural traits that are essential to qom culture are shared by these groups, which are traditionally hunter-gatherer.
Variants
The contributors to this corpus originate from Chaco and Formosa provinces in Argentina. This area encompasses four ethnodialectal subregions with distinct self-identification terms (Messineo, 1991) 1.
| Area | Province | Locations | Variant (self-identification) |
|---|---|---|---|
| Northwest | Chaco | El Colchwón, El Espinillo and the Bermejo river’s surroundings | dapigemlʔek |
| Northcenter | Chaco | Pampa del Indio | noʔolgaGanaq |
| Southcenter | Chaco | Sáenz Peña, Machahay, Quitilipi | lʔañaGashek |
| Southeast | Chaco, Eastern Formosa | Las Palmas, Clorinda | takshek |
For further information, see 2 1 3.
Data splits for modelling
| Split | Count |
|---|---|
| Train | 946 |
| Test | 367 |
| Dev | 341 |
Transcriptions
Prompts:
136Duration:
11:02:08 [h:m:s]Avg. Transcription Len:
133Avg. Duration:
25.26[s]Valid Duration:
38546.136[s]Total hours:
11.04[h]Valid hours:
10.71[h]
This corpus consists of 1350 utterances approximately, totalling 10hs of transcribed speech. The dataset does not focus on any particular domain or topics. The question set consists of 150 instances, covering a wide range of general topics about lifestyles and culture (hobbies, education, traditions, nature, food, society, technology, relationships, art, etc). Speakers responded to each question based on their personal belief, experiences, and knowledge, mainly to describe their culture or share their personal opinion about how they interact within the society (e.g. how they would find a lawyer, how they make a medical appointment, etc).
The dataset does not contain any data that might be considered sensitive for others, to the best of the author's knowledge.
The data collection involved a coordinator (a PhD student), a linguist known by the Qom contributors (researcher), and three field-work assistants (linguists). The data was collected mainly by the Qom contributors using their own phones at home, after receiving technical training. A small proportion of data was recorded in an academic setting (e.g. research institute) during the training phase.
Writing system
The transcriptions follows the orthographic systems proposed by Buckwalter (2001) 2
Symbol table
a c ch d e g hu i j l ll m n ñ o p q qu r s sh t u v x y ỹ ’
Samples
Questions
There follows a randomly selected sample of questions used in the corpus.
¿Negue’t ca ỹataqta ’anqopita can ñaq edaxaatac?
¿Negue’t na dalaxaic no’onatac ’auaỹaten da ’au’ot nagui?
¿’Eetec ca n’qochenaxac na shiỹaxaupi yi ’adma’ nquicapiguic da qalota?
Eetec na napo yi arma?
¿Negue’t taxa ca ‘auauotaique da lmenec ca qoueta’a ’adma’?
Responses
There follows a randomly selected sample of transcribed responses from the corpus.
Aýem ýataqta yoqopita can ñaq sedaxaatac satapoigui asa'aso paxaguenaxaqui dam ýataqta yoqopita naxa da qai'axaia asa'aso huo'o da l-lamaxa ichoxot da ne'enapi napaxaguetacpi ñaqpiolec cad'ac nmatec cada'ac da'ashe nache da qai'axaia asa'aso natoina nache huo'o da lpe'e na ñaqpiolec naxa eso so ýoqta yoqopita
Aýem na saýaten nagui saýaten da ño'oxosheguem na noýic
Da huo'o ña nqui'c shiyaxaua cha'aye nachena na jec ilotaique cam l'onatac dam yanatac cam ilotaique nache ishet da machiguiñi
yi imaa' chaco onaxaic shinatap .cahioloqta da nopo
Yi 'ima' qaiuen aca menaxanaxaqui da ýotta'a't na lmenec nallec
Recommended post-processing
To be updated in the next release. Contact the author for details.
Fields
Each row of a tsv file represents a single audio clip, and contains the following information:
client_id- hashed UUID of a given useraudio_id- numeric id for audio fileaudio_file- audio file nameduration_ms- duration of audio in millisecondsprompt_id- numeric id for promptprompt- question for usertranscription- transcription of the audio responsevotes- number of people that who approved a given transcriptage- age of the speaker4gender- gender of the speaker4language- language namesplit- for data modelling, which subset of the data does this clip pertain tochar_per_sec- how many characters of transcription per second of audioquality_tags- some automated assessment of the transcription--audio pair, separated by|transcription-length- character per second under 3 characters per secondspeech-rate- characters per second over 30 characters per secondshort-audio- audio length under 2 secondslong-audio- audio length over 30 seconds
Get involved!
Community links
Contribute
Acknowledgements
Datasheet authors
Belu Ticona <mticonao@gmu.edu>
Paola Cúneo
Antonios Anastasopoulos
Citation guidelines
B. Ticona, P. Cuneo. A. Anastasopoulos. “Datasheet of Spontaneous Speech Corpus for Qom - Mozilla Common Voice”. Revised on Aug 29th, 2025. [Publication Date].
Funding
The speaker collaborators were funded by Mozilla Common Voice. The project coordinator was partially funded by the US NSF grants 2346334 and 2439202.
This dataset was partially funded by the Open Multilingual Speech Fund managed by Mozilla Common Voice.
Licence
This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.
Footnotes
Messineo, Cristina. 1991. Variantes dialectales del complejo lingüístico toba. Hacia una nueva carta étnica del Gran Chaco II: 12-22. Las Lomitas: Centro del Hombre Antiguo Chaqueño. ↩ ↩2
Buckwalter, Alberto. 2001 [1980]). Vocabulario toba. Formosa / Indiana: Equipo Menonita / Mennonite Board of Missions. Ed. Revisada. ↩ ↩2
Messineo, Cristina. 2003. Lengua Toba (guaycurú). Aspectos gramaticales y discursivos. Lincom Studies in Native American Linguistics 48. Münich: Lincom Europa. ↩
For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2
