Common Voice Spontaneous Speech 3.0 - Ligurian

License:

CC0-1.0

Steward:

Common Voice

Task: ASR

Release Date: 3/22/2026

Format: MP3

Size: 48.36 MB

Description

A collection of spontaneous responses to questions in Ligurian (Ligure).

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Restrictions/Special Constraints

None provided.

Forbidden Usage

It is forbidden to attempt to determine the identity of speakers in the Common Voice datasets. It is forbidden to re-host or re-share this dataset.

Processes

Intended Use

This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.

Metadata

Ligure — Ligurian (`lij`)

This datasheet is for sps-corpus-3.0-2026-03-09 of the Mozilla Common Voice Spontaneous Speech dataset for Ligurian [Ligure - lij]. The dataset contains 294 clips representing 2.36 hours of recorded speech (1.65 hours validated) from 5 speakers.

Data splits for modelling

The dataset clips are categorised by transcription status and training-set assignment. The following tables summarise the distribution.

Audio clips

Bucket	Clips	%
Transcribed & Validated	223	75.9%
Transcribed & Pending	13	4.4%
Not transcribed	58	19.7%

Training splits

Bucket	Clips	%
Train	0	0.0%
Dev	0	0.0%
Test	0	0.0%
Unassigned	294	100.0%

Training split coverage: 0 of 223 transcribed & validated clips (0.0%)

Transcriptions

Transcription status

Bucket	Clips	%
Validated	223	94.5%
Pending	13	5.5%
Edited	57	24.1%

Samples

Questions

There follows a randomly selected sample of questions used in the corpus.

Inta teu coltua, chi o l’é ch’o stà derê a-e persoñe ançiañe pe aggiuttâle?
Fanni unna lista di çinque spòrt ciù pratticæ inta teu çittæ ò region.
Quæ son i pesci che ti mangi ciù de spesso?
Te vëgne in cheu quarcheduña de ciante che se coltivan inta teu region?
Segondo ti, perché i turisti vëgnan à vixitâ a teu region?

Responses

There follows a randomly selected sample of transcribed responses from the corpus.

Aloa, beseugna vedde pe primma cösa cöse voemmo dî con a mæ region. Perché se parlemmo da Liguria, aloa lì à mi me pâ che che e gente eh se mettan de cöse eh dignitose, nette, ma magara sensa voei attrae tròppo l’attençion di atri. Incangio se pensemmo à de atre regioin in Italia eh pe exempio no sò Milan eh magara inte quelli pòsti lie e gente eh dan un pittin ciù a mente a-a mòdda. Se incangio parlemmo da Califòrnia, aloa chì se vedde se vedde pròpio de tutto. Eh ti væ a-o supermercou e e gh’é gh’é e gente che se mettan, che che son lie co-o co-o pigiama e e pantofoe.
No saviæ, penso che seggian mâ de scheña, mâ de steumago, problemi do figæto...
Aloa, eh, ghe n’é un muggio de röba da ammiâ à Zena. Ghe saieiva da anâ, tanto in into çentro stòrico ghe saieiva da giâ pe coscì pe-i pe-i carroggi. Eh ba- bastieiva pe passâ, ma an- ma ascì ciù de unna giornâ euh solo che ammiâ eh in via Garibaldi, Palasso Rosso, Palasso Gianco, pöi anæ anâ un pö ciù in là e vedde Palasso Spinola, eh solo che quello eh se gh’é brutto tempo euh se peu anâ à vedde quelli e o l’é ascì eh eh importante o Galata, che anche lì gh’é un muggio de röba bella. Dapeu eh abbasta ascì à fâ un gio, perché o i carroggi son un un museo à çê averto. Dapeu se un o gh’à coæ, peu piggiâ ascì a a funicolare, e an-, se s’a fonçioña, e anâ in sciô Righi e pe pe vedde Zena da l’erto, ma ascì se peu anâ in Castelletto, piggiâ o l’ascensore e che se vedde in Zena inte un mòddo belliscimo. Euh, ghe n’é da giâ pe pe coscì.
Aloa eh inta region ghe n’é tante ciante velenose. Se... dimmoghe quelle che son inti giardin eh son quelle ciù, che se vedde ciù de spesso son i oliandri, ma dapeu gh’é ascì o tascio e o ricin. E de de fonzi... fonzi... fonzi ghe n’é tanti, ghe n’é veram- gh’é l’amanita fallòide, ch’a l’é quella che te fa pròpio moî, ma dapeu eh gh’é ascì e diette, che oua no se veddan quæxi ciù, ma ean eh te favan anâ a-o leugo e... ma, ma tanto eh? E dapeu gh’ean ghe n’é un muggio tra e combette ghe n’é de quelle che son che fan pròpio mâ. E scì ma gh’é... l’elenco o saieiva longo.
Mah chì à dî a veitæ gh’é fiña difficoltæ à dî quæ seggian e tradiçioin tipiche da region, perché tante son inventæ oua. No saviæ. Son son in scî pissi de de scentâ areo e tradiçioin.

Fields

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
audio_id - numeric id for audio file
audio_file - audio file name
duration_ms - duration of audio in milliseconds
prompt_id - numeric id for prompt
prompt - question for user
transcription - transcription of the audio response
votes - number of people that who approved a given transcript
age - age of the speaker1
gender - gender of the speaker1
language - language name
split - for data modelling, which subset of the data does this clip pertain to
char_per_sec - how many characters of transcription per second of audio
quality_tags - some automated assessment of the transcription--audio pair, separated by |
- transcription-length - character per second under 3 characters per second
- speech-rate - characters per second over 30 characters per second
- short-audio - audio length under 2 seconds
- long-audio - audio length over 5 minutes

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2