Common Voice Spontaneous Speech 3.0 - Ligurian
License:
CC0-1.0
Steward:
Common VoiceTask: ASR
Release Date: 3/22/2026
Format: MP3
Size: 48.36 MB
Share
Description
A collection of spontaneous responses to questions in Ligurian (Ligure).
Specifics
Considerations
Restrictions/Special Constraints
None provided.
Forbidden Usage
It is forbidden to attempt to determine the identity of speakers in the Common Voice datasets. It is forbidden to re-host or re-share this dataset.
Processes
Intended Use
This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.
Metadata
Ligure — Ligurian (lij)
This datasheet is for sps-corpus-3.0-2026-03-09 of the Mozilla Common Voice Spontaneous Speech dataset for Ligurian [Ligure - lij]. The dataset contains 294 clips representing 2.36 hours of recorded speech (1.65 hours validated) from 5 speakers.
Data splits for modelling
The dataset clips are categorised by transcription status and training-set assignment. The following tables summarise the distribution.
Audio clips
| Bucket | Clips | % |
|---|---|---|
| Transcribed & Validated | 223 | 75.9% |
| Transcribed & Pending | 13 | 4.4% |
| Not transcribed | 58 | 19.7% |
Training splits
| Bucket | Clips | % |
|---|---|---|
| Train | 0 | 0.0% |
| Dev | 0 | 0.0% |
| Test | 0 | 0.0% |
| Unassigned | 294 | 100.0% |
Training split coverage: 0 of 223 transcribed & validated clips (0.0%)
Transcriptions
Transcription status
| Bucket | Clips | % |
|---|---|---|
| Validated | 223 | 94.5% |
| Pending | 13 | 5.5% |
| Edited | 57 | 24.1% |
Samples
Questions
There follows a randomly selected sample of questions used in the corpus.
Inta teu coltua, chi o l’é ch’o stà derê a-e persoñe ançiañe pe aggiuttâle?
Fanni unna lista di çinque spòrt ciù pratticæ inta teu çittæ ò region.
Quæ son i pesci che ti mangi ciù de spesso?
Te vëgne in cheu quarcheduña de ciante che se coltivan inta teu region?
Segondo ti, perché i turisti vëgnan à vixitâ a teu region?
Responses
There follows a randomly selected sample of transcribed responses from the corpus.
Aloa, beseugna vedde pe primma cösa cöse voemmo dî con a mæ region. Perché se parlemmo da Liguria, aloa lì à mi me pâ che che e gente eh se mettan de cöse eh dignitose, nette, ma magara sensa voei attrae tròppo l’attençion di atri. Incangio se pensemmo à de atre regioin in Italia eh pe exempio no sò Milan eh magara inte quelli pòsti lie e gente eh dan un pittin ciù a mente a-a mòdda. Se incangio parlemmo da Califòrnia, aloa chì se vedde se vedde pròpio de tutto. Eh ti væ a-o supermercou e e gh’é gh’é e gente che se mettan, che che son lie co-o co-o pigiama e e pantofoe.
No saviæ, penso che seggian mâ de scheña, mâ de steumago, problemi do figæto...
Aloa, eh, ghe n’é un muggio de röba da ammiâ à Zena. Ghe saieiva da anâ, tanto in into çentro stòrico ghe saieiva da giâ pe coscì pe-i pe-i carroggi. Eh ba- bastieiva pe passâ, ma an- ma ascì ciù de unna giornâ euh solo che ammiâ eh in via Garibaldi, Palasso Rosso, Palasso Gianco, pöi anæ anâ un pö ciù in là e vedde Palasso Spinola, eh solo che quello eh se gh’é brutto tempo euh se peu anâ à vedde quelli e o l’é ascì eh eh importante o Galata, che anche lì gh’é un muggio de röba bella. Dapeu eh abbasta ascì à fâ un gio, perché o i carroggi son un un museo à çê averto. Dapeu se un o gh’à coæ, peu piggiâ ascì a a funicolare, e an-, se s’a fonçioña, e anâ in sciô Righi e pe pe vedde Zena da l’erto, ma ascì se peu anâ in Castelletto, piggiâ o l’ascensore e che se vedde in Zena inte un mòddo belliscimo. Euh, ghe n’é da giâ pe pe coscì.
Aloa eh inta region ghe n’é tante ciante velenose. Se... dimmoghe quelle che son inti giardin eh son quelle ciù, che se vedde ciù de spesso son i oliandri, ma dapeu gh’é ascì o tascio e o ricin. E de de fonzi... fonzi... fonzi ghe n’é tanti, ghe n’é veram- gh’é l’amanita fallòide, ch’a l’é quella che te fa pròpio moî, ma dapeu eh gh’é ascì e diette, che oua no se veddan quæxi ciù, ma ean eh te favan anâ a-o leugo e... ma, ma tanto eh? E dapeu gh’ean ghe n’é un muggio tra e combette ghe n’é de quelle che son che fan pròpio mâ. E scì ma gh’é... l’elenco o saieiva longo.
Mah chì à dî a veitæ gh’é fiña difficoltæ à dî quæ seggian e tradiçioin tipiche da region, perché tante son inventæ oua. No saviæ. Son son in scî pissi de de scentâ areo e tradiçioin.
Fields
Each row of a tsv file represents a single audio clip, and contains the following information:
client_id- hashed UUID of a given useraudio_id- numeric id for audio fileaudio_file- audio file nameduration_ms- duration of audio in millisecondsprompt_id- numeric id for promptprompt- question for usertranscription- transcription of the audio responsevotes- number of people that who approved a given transcriptage- age of the speaker1gender- gender of the speaker1language- language namesplit- for data modelling, which subset of the data does this clip pertain tochar_per_sec- how many characters of transcription per second of audioquality_tags- some automated assessment of the transcription--audio pair, separated by|transcription-length- character per second under 3 characters per secondspeech-rate- characters per second over 30 characters per secondshort-audio- audio length under 2 secondslong-audio- audio length over 5 minutes
Get involved
Community links
Discussions
Contribute
Licence
This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.
Footnotes
For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2