Common Voice Scripted Speech 23.0 - Cornish

Locale: kw

Size: 260.70 MB

Task: ASR

Format: MP3

License: CC-0


Kernowek — Cornish (kw)

This datasheet is for version 23.0 of the the Mozilla Common Voice Scripted Speech dataset for Cornish (kw). The dataset contains 13 hours of recorded speech (13 hours validated) from 10 speakers.

Language

Cornish, or Kernewek, is a Brythonic language, alongside Breton and Welsh, and part of the Celtic Indo-European language family. It is an indigenous language of the United Kingdom, with most speakers located in Cornwall. In the 2021 UK Census 567 people self-identified Cornish as their main language. UNESCO has classified its status as "severely endangered".

Demographic information

The dataset includes the following distribution of age and gender.

Gender

Self-declared gender information, percentage refers to the number of clips annotated with this gender.

GenderPertentage
Undefined66.0%
Female Feminine34.0%

Age

Self-declared age information, percentage refers to the number of clips annotated with this age band.

Age BandPercentage
Undefined12.0%
Fourties34.0%
Fifties47.0%
Sixties2.0%
Seventies5.0%

Text corpus

The dataset contains 10.8 validated hours of speech from 10 unique contributors.

TypeCountHours
Validated Clips9,35710.8
Invalidated Clips00.00
Total Clips9,35710.8
  • Average sentence length (tokens): 6.4

  • Average sentence length (characters): 31

Writing system

Cornish has several writing systems in place. The majority of this dataset uses the Standard Written Form, established in 2008.

Symbol table

The dataset uses the following characters: ' - ! , . ? a b c d e f g h i j k l m n o p r s t u v w x y z

Sample

There follows a randomly selected sample of five sentences from the corpus.

A yllyn ni redya hemma?
Marthys ens i.
Esos. Yth esos ta ena y'n kornel.
A wrussyn ni diwrosa yn uskis?
Dha leveryans yw nebes da.

Automatic random samples

Yma agan lyvrow genen.
Yma agan lyvrow genowgh.
Gwrussys. Ty a wrug kana fest yn teg.
Yth esons i ryb an wariva.
Gwell via genev a pe yeynna.

Sources

The text for this dataset comes from the following sources:

  • IndyLan Cornish course. Author: Cornish Language Office. Standard Written Form.

  • Individual sentences submitted by users through the Mozilla Common Voice interface (public domain)

Text domains

  • General — The majority of this dataset focuses on conversational phrases with the intention to cover a broad range of grammatical points.

Recommended post-processing

  • Check the data for Unicode errors in the Cornish. These should be the character '.

Community links

Contribute

Datasheet authors

Funding

This dataset was partially funded by the Open Multilingual Speech Fund.

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.