Common Voice Scripted Speech 25.0 - French

License:

CC0-1.0

Steward:

Common Voice

Task: ASR

Release Date: 3/25/2026

Format: MP3

Size: 28.39 GB

Description

A collection of read speech recordings in French (Français).

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Restrictions/Special Constraints

None provided.

Forbidden Usage

It is forbidden to attempt to determine the identity of speakers in the Common Voice datasets. It is forbidden to re-host or re-share this dataset.

Processes

Intended Use

This dataset is intended to be used for training and evaluating automatic speech recognition (ASR) models. It may also be used for applications relating to computer-aided language learning (CALL) and language or heritage revitalisation.

Metadata

Français — French (`fr`)

This datasheet is for cv-corpus-25.0-2026-03-09 of the Mozilla Common Voice Scripted Speech dataset for French [Français - fr]. The dataset contains 864728 clips representing 1209.71 hours of recorded speech (1095.87 hours validated) from 21003 speakers, recorded from a text corpus of 1,692,862 sentences.

Language

French is a Romance language. It is the official language of 26 countries and is spoken across around 50 countries.

Variants

Code	Variant	Clips	Speakers
fr-metro	Français de métropole	542,845 (62.8%)	4,639 (22.1%)
fr-europe	Français d'Europe	26,543 (3.1%)	442 (2.1%)
fr-namerica	Français d'Amérique du Nord	14,424 (1.7%)	309 (1.5%)
fr-safrica	Français d'Afrique subsaharienne et des îles africaines	2,066 (0.2%)	87 (0.4%)
fr-droum	Français des départements et régions d'outre-mer	1,936 (0.2%)	40 (0.2%)
fr-nafrica	Français du nord de l'Afrique	1,342 (0.2%)	61 (0.3%)
fr-samerica	Français d'Amérique du Sud et des Caraïbes	90 (0.0%)	6 (0.0%)

Accents

Code	Accent	Clips	Speakers
canada	Français du Canada	12,869 (1.5%)	275 (1.3%)
belgium	Français de Belgique	11,381 (1.3%)	226 (1.1%)
switzerland	Français de Suisse	5,916 (0.7%)	141 (0.7%)
united_states	Français des États-Unis	1,610 (0.2%)	41 (0.2%)
reunion	Français de La Réunion	1,307 (0.2%)	16 (0.1%)
benin	Français du Bénin	1,073 (0.1%)	7 (0.0%)
algeria	Français d’Algérie	1,070 (0.1%)	26 (0.1%)
germany	Français d’Allemagne	552 (0.1%)	26 (0.1%)
fr-metro-north	Français du nord de la France	535 (0.1%)	2 (0.0%)
united_kingdom	Français du Royaume-Uni	502 (0.1%)	25 (0.1%)
haiti	Français d’Haïti	498 (0.1%)	7 (0.0%)
madagascar	Français de Madagascar	283 (0.0%)	12 (0.1%)
fr-metro-south	Français du sud de la France	229 (0.0%)	9 (0.0%)
morocco	Français du Maroc	211 (0.0%)	30 (0.1%)
fr-metro-east	Français de l'est de la France	209 (0.0%)	3 (0.0%)
cote_d_ivoire	Français de Côte d’Ivoire	201 (0.0%)	18 (0.1%)
senegal	Français du Sénégal	197 (0.0%)	16 (0.1%)
french_guiana	Français de Guyane	188 (0.0%)	3 (0.0%)
guadeloupe	Français de Guadeloupe	175 (0.0%)	13 (0.1%)
italy	Français d’Italie	171 (0.0%)	9 (0.0%)
fr-metro-west	Français de l'ouest de la France	166 (0.0%)	7 (0.0%)
cameroon	Français du Cameroun	163 (0.0%)	16 (0.1%)
new_caledonia	Français de Nouvelle-Calédonie	159 (0.0%)	3 (0.0%)
romania	Français de Roumanie	150 (0.0%)	6 (0.0%)
tunisia	Français de Tunisie	121 (0.0%)	16 (0.1%)
monaco	Français de Monaco	111 (0.0%)	3 (0.0%)
netherlands	Français des Pays-Bas	101 (0.0%)	4 (0.0%)
martinique	Français de Martinique	100 (0.0%)	7 (0.0%)
congo_kinshasa	Français du Congo (Kinshasa)	45 (0.0%)	5 (0.0%)
mali	Français du Mali	39 (0.0%)	4 (0.0%)
luxembourg	Français du Luxembourg	20 (0.0%)	3 (0.0%)
st_pierre_et_miquelon	Français de Saint-Pierre-et-Miquelon	15 (0.0%)	1 (0.0%)
mayotte	Français de Mayotte	12 (0.0%)	1 (0.0%)
mauritius	Français de l’Île Maurice	10 (0.0%)	2 (0.0%)
-	Other	7,511 (0.9%)	244 (1.2%)

Demographic information

The dataset includes the following self-declared age and gender distributions. A coverage summary is shown below each table.

Gender

Self-declared gender information. The table shows clip and speaker counts with percentages. Speakers who did not declare a gender are listed as Unspecified. A dash (-) indicates zero.

Code	Gender	Clips	Speakers
male_masculine	Male, masculine	491,443 (56.8%)	3,878 (18.5%)
female_feminine	Female, feminine	92,369 (10.7%)	1,010 (4.8%)
transgender	Transgender	5 (0.0%)	1 (0.0%)
non-binary	Non-binary	249 (0.0%)	3 (0.0%)
do_not_wish_to_say	Prefer not to say	302 (0.0%)	4 (0.0%)
-	Unspecified	280,360 (32.4%)	16,786 (79.9%)

Gender declared: 584,368 of 864,728 clips (67.6%), 4,217 of 21,003 speakers (20.1%)

Age

Self-declared age information. The table shows clip and speaker counts with percentages. Speakers who did not declare an age are listed as Unspecified. A dash (-) indicates zero.

Code	Age	Clips	Speakers
teens	Teens	24,515 (2.8%)	444 (2.1%)
twenties	Twenties	147,928 (17.1%)	1,732 (8.2%)
thirties	Thirties	125,467 (14.5%)	1,156 (5.5%)
fourties	Fourties	122,937 (14.2%)	855 (4.1%)
fifties	Fifties	81,444 (9.4%)	496 (2.4%)
sixties	Sixties	28,974 (3.4%)	326 (1.6%)
seventies	Seventies	9,224 (1.1%)	121 (0.6%)
eighties	Eighties	212 (0.0%)	7 (0.0%)
nineties	Nineties	5 (0.0%)	1 (0.0%)
-	Unspecified	324,022 (37.5%)	16,619 (79.1%)

Age declared: 540,706 of 864,728 clips (62.5%), 4,384 of 21,003 speakers (20.9%)

Data splits for modelling

Clip buckets

Bucket	Clips
Validated	783,357 (90.6%)
Invalidated	68,142 (7.9%)
Other	13,229 (1.5%)

Training splits

Split	Clips
Train	613,431 (78.3%)
Dev	16,201 (2.1%)
Test	16,201 (2.1%)

Training split coverage: 645,833 of 783,357 validated clips (82.4%)

The dataset contains 783357 validated, 68142 invalidated, and 13229 unresolved clips. The average clip duration is 5.036 seconds.

Text corpus

Validated sentences: 1,649,097

Category	Count
Unvalidated sentences	43,765
Pending sentences	43,638
Rejected sentences	127
Reported sentences	7,562

The corpus contains 1,692,862 sentences: 1,649,097 validated and 43,765 unvalidated (43,638 pending review, 127 rejected), with 7,562 reported for review.

Writing system

The French language uses the 26 letters of the Latin alphabet with the addition of two ligatures (æ, œ) and five diacritics.

Symbol table

a à â æ b c ç d e é è ê ë f g h i î ï j k l m n ô œ p q r s t u ù û ü v w x y ÿ z

Sample

There follows a randomly selected sample of five sentences from the corpus.

Le canton de Tende était composé des communes de Tende et La Brigue.
Cette décision fait grand bruit.
Un paludier est un travailleur qui récolte le sel des marais salants.
Les jardins sont ouverts au public.
Église mononef, elle ne présente aucun caractère particulier.

Sources

Source	Sentences
wiki-2	719,731 (43.8%)
wiki-1	717,145 (43.7%)
sentence-collector	103,289 (6.3%)
issue2259_deleted_export_readd_fixed	62,385 (3.8%)
Other	39,075 (2.4%)

Text domains

Code	Domain	Clips	Speakers
general	General	70 (0.0%)	53 (0.3%)
agriculture_food	Agriculture and Food	-	-
automotive_transport	Automotive and Transport	1 (0.0%)	1 (0.0%)
finance	Finance	1 (0.0%)	1 (0.0%)
service_retail	Service and Retail	-	-
healthcare	Healthcare	5 (0.0%)	4 (0.0%)
history_law_government	History, Law and Government	19 (0.0%)	17 (0.1%)
media_entertainment	Media and Entertainment	17 (0.0%)	13 (0.1%)
nature_environment	Nature and Environment	8 (0.0%)	8 (0.0%)
news_current_affairs	News and Current Affairs	2 (0.0%)	2 (0.0%)
technology_robotics	Technology and Robotics	18 (0.0%)	12 (0.1%)
language_fundamentals	Language Fundamentals	7 (0.0%)	5 (0.0%)

Fields

Clips

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
path - relative path of the audio file
text - supposed transcription of the audio
up_votes - number of people who said audio matches the text
down_votes - number of people who said audio does not match text
age - age of the speaker1
gender - gender of the speaker1
accents - accents of the speaker1
variant - variant of the language1
segment - if sentence belongs to a custom dataset segment, it will be listed here
prompt_upvotes - number of upvotes the sentence prompt received
prompt_reports - number of reports the sentence prompt received
is_edited - whether the clip's transcription has been edited

`validated_sentences.tsv`

The validated_sentences.tsv file contains one row per validated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
is_used - whether the sentence is still in circulation for recording
clips_count - number of clips recorded for this sentence

`unvalidated_sentences.tsv`

The unvalidated_sentences.tsv file contains one row per unvalidated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
up_votes - number of upvotes the sentence received
down_votes - number of downvotes the sentence received
status - current status of the sentence (pending or rejected)

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2 ↩3 ↩4