Mozilla Common Voice Text Language Identification dataset

License icon

License:

CC0-1.0

Shield icon

Steward:

Common Voice

Task: NLP

Release Date: 12/16/2025

Format: TSV

Size: 950.41 MB


Share

Description

A dataset for text-based language identification of 19 Million sentences from over 300 languages taken from Mozilla Common Voice scripted (v23) and spontaneous (v1) speech projects.

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Processes

Intended Use

Training and evaluation of text-based language identification systems.

Metadata

The text in this dataset comes from over 300 languages' Mozilla Common Voice projects, including both scripted speech (v23) and spontaneous speech (v1). It is intended to be used for textual language identification.

Structure

There are two files included: (1) mcv_text_lid.tsv, the main dataset file, which contains only validated "sentences" (single sentences for scripted speech, full transcriptions for spontaneous speech) and language code, and (2) /language_histogram.csv, a histogram with the number of sentences per language. The columns in mcv_text_lid.tsv are as follows:

columndescription
idunique numerical identifier for each sentence
sentencesentence text for scripted speech, transcription for spontaneous speech
langlanguage code corresponding to the text
sentence_domainthe sentence domain (optional, and only exists for scripted speech)
sourcesource of the sentence (optional, and only exists for scripted speech). These are reported by the person submitting sentences to a Common Voice project
style"scripted" or "spontaneous"
splitone of "train", "dev", or "test. These splits do not correspond to the MCV splits

Preprocessing

The data was deduplicated on sentence+lang.

Sample

idsentencelangsentence_domainsourcestylesplit
13914460Il più grande di loro fu Giovanni evangelista.itwikiscriptedtrain
7896754The underground system develops in the limestone and marl-limestone of the superior Cretaceous.encovost2-xx_enscriptedtrain
10580213À boire! répéta pour la troisième fois Quasimodo pantelant.frsentence-collectorscriptedtest
14529549岩手県宮古市ja?????scriptedtrain
1076937Y en ixas va alentar de segundas.angeneralselfscriptedtrain
18424113ข้อบังคับการประชุมthsentence-collectorscriptedtrain