Common Voice Spontaneous Speech 2.0 - Toba Qom

License icon

License:

CC0-1.0

Shield icon

Steward:

Common Voice

Task: ASR

Release Date: 12/5/2025

Format: MP3

Size: 172.41 MB


Description

A collection of spontaneous spoken phrases in Toba Qom.

Considerations

Forbidden Usage

It is forbidden to attempt to determine the identity of speakers in the common Voice datasets. It is forbidden to re-host or re-share this dataset

Processes

Intended Use

This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.

Metadata

Toba Qom — Toba Qom (tob)

This datasheet is for version 2.0 of the the Mozilla Common Voice Spontaneous Speech dataset for Toba Qom (tob). The dataset contains 1573 clips representing 12 hours of recorded speech (11 hours validated) from 25 speakers.

Language

The Toba Qom language is an endangered language spoken in Gran Chaco, a region spanned over Argentina, Paraguay and Bolivia. As per the official demographic data provided by the Argentinian state, the population of Qom individuals is estimated at 80,000, of which approximately 49% are speakers of the oral form of the language. The term "qom" describes a population that has traditionally been arranged into multiple extended families or groups. Language and sociocultural traits that are essential to qom culture are shared by these groups, which are traditionally hunter-gatherer.

Variants

The contributors to this corpus originate from Chaco and Formosa provinces in Argentina. This area encompasses four ethnodialectal subregions with distinct self-identification terms (Messineo, 1991) 1.

AreaProvinceLocationsVariant (self-identification)
NorthwestChacoEl Colchwón, El Espinillo and the Bermejo river’s surroundingsdapigemlʔek
NorthcenterChacoPampa del IndionoʔolgaGanaq
SouthcenterChacoSáenz Peña, Machahay, QuitilipilʔañaGashek
SoutheastChaco, Eastern FormosaLas Palmas, Clorindatakshek

For further information, see 2 1 3.

Data splits for modelling

SplitCount
Train946
Test367
Dev341

Transcriptions

  • Prompts: 136

  • Duration: 11:02:08 [h:m:s]

  • Avg. Transcription Len: 133

  • Avg. Duration: 25.26[s]

  • Valid Duration: 38546.136[s]

  • Total hours: 11.04[h]

  • Valid hours: 10.71[h]

This corpus consists of 1350 utterances approximately, totalling 10hs of transcribed speech. The dataset does not focus on any particular domain or topics. The question set consists of 150 instances, covering a wide range of general topics about lifestyles and culture (hobbies, education, traditions, nature, food, society, technology, relationships, art, etc). Speakers responded to each question based on their personal belief, experiences, and knowledge, mainly to describe their culture or share their personal opinion about how they interact within the society (e.g. how they would find a lawyer, how they make a medical appointment, etc).

The dataset does not contain any data that might be considered sensitive for others, to the best of the author's knowledge.

The data collection involved a coordinator (a PhD student), a linguist known by the Qom contributors (researcher), and three field-work assistants (linguists). The data was collected mainly by the Qom contributors using their own phones at home, after receiving technical training. A small proportion of data was recorded in an academic setting (e.g. research institute) during the training phase.

Writing system

The transcriptions follows the orthographic systems proposed by Buckwalter (2001) 2

Symbol table

a c ch d e g hu i j l ll m n ñ o p q qu r s sh t u v x y ỹ ’

Samples

Questions

There follows a randomly selected sample of questions used in the corpus.

¿Negue’t ca ỹataqta ’anqopita can ñaq edaxaatac?
¿Negue’t na dalaxaic no’onatac ’auaỹaten da ’au’ot nagui?
¿’Eetec ca n’qochenaxac na shiỹaxaupi yi ’adma’ nquicapiguic da qalota?
Eetec na napo yi arma?
¿Negue’t taxa ca ‘auauotaique da lmenec ca qoueta’a ’adma’?
Responses

There follows a randomly selected sample of transcribed responses from the corpus.

Aýem ýataqta yoqopita can ñaq sedaxaatac satapoigui asa'aso paxaguenaxaqui dam ýataqta yoqopita naxa da qai'axaia asa'aso huo'o da l-lamaxa ichoxot da ne'enapi napaxaguetacpi ñaqpiolec cad'ac nmatec cada'ac da'ashe nache da qai'axaia asa'aso natoina nache huo'o da lpe'e na ñaqpiolec naxa eso so ýoqta yoqopita
Aýem na saýaten nagui saýaten da ño'oxosheguem na noýic
Da huo'o ña nqui'c shiyaxaua cha'aye nachena na jec ilotaique cam l'onatac dam yanatac cam ilotaique nache ishet da machiguiñi
yi imaa' chaco onaxaic shinatap .cahioloqta da nopo
Yi 'ima' qaiuen aca menaxanaxaqui da ýotta'a't na lmenec nallec

Recommended post-processing

To be updated in the next release. Contact the author for details.

Fields

Each row of a tsv file represents a single audio clip, and contains the following information:

  • client_id - hashed UUID of a given user

  • audio_id - numeric id for audio file

  • audio_file - audio file name

  • duration_ms - duration of audio in milliseconds

  • prompt_id - numeric id for prompt

  • prompt - question for user

  • transcription - transcription of the audio response

  • votes - number of people that who approved a given transcript

  • age - age of the speaker4

  • gender - gender of the speaker4

  • language - language name

  • split - for data modelling, which subset of the data does this clip pertain to

  • char_per_sec - how many characters of transcription per second of audio

  • quality_tags - some automated assessment of the transcription--audio pair, separated by |

    • transcription-length - character per second under 3 characters per second

    • speech-rate - characters per second over 30 characters per second

    • short-audio - audio length under 2 seconds

    • long-audio - audio length over 30 seconds

Get involved!

Community links

Contribute

Acknowledgements

Datasheet authors

Citation guidelines

B. Ticona, P. Cuneo. A. Anastasopoulos. “Datasheet of Spontaneous Speech Corpus for Qom - Mozilla Common Voice”. Revised on Aug 29th, 2025. [Publication Date].

Funding

The speaker collaborators were funded by Mozilla Common Voice. The project coordinator was partially funded by the US NSF grants 2346334 and 2439202.

This dataset was partially funded by the Open Multilingual Speech Fund managed by Mozilla Common Voice.

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

  1. Messineo, Cristina. 1991. Variantes dialectales del complejo lingüístico toba. Hacia una nueva carta étnica del Gran Chaco II: 12-22. Las Lomitas: Centro del Hombre Antiguo Chaqueño. 2

  2. Buckwalter, Alberto. 2001 [1980]). Vocabulario toba. Formosa / Indiana: Equipo Menonita / Mennonite Board of Missions. Ed. Revisada. 2

  3. Messineo, Cristina. 2003. Lengua Toba (guaycurú). Aspectos gramaticales y discursivos. Lincom Studies in Native American Linguistics 48. Münich: Lincom Europa.

  4. For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. 2