Mozilla Common Voice Text Language Identification dataset
License:
CC0-1.0
Steward:
Common VoiceTask: NLP
Release Date: 12/16/2025
Format: TSV
Size: 950.41 MB
Share
Description
A dataset for text-based language identification of 19 Million sentences from over 300 languages taken from Mozilla Common Voice scripted (v23) and spontaneous (v1) speech projects.
Specifics
Processes
Intended Use
Training and evaluation of text-based language identification systems.
Metadata
The text in this dataset comes from over 300 languages' Mozilla Common Voice projects, including both scripted speech (v23) and spontaneous speech (v1). It is intended to be used for textual language identification.
Structure
There are two files included: (1) mcv_text_lid.tsv, the main dataset file, which contains only validated "sentences" (single sentences for scripted speech, full transcriptions for spontaneous speech) and language code, and (2) /language_histogram.csv, a histogram with the number of sentences per language. The columns in mcv_text_lid.tsv are as follows:
| column | description |
|---|---|
| id | unique numerical identifier for each sentence |
| sentence | sentence text for scripted speech, transcription for spontaneous speech |
| lang | language code corresponding to the text |
| sentence_domain | the sentence domain (optional, and only exists for scripted speech) |
| source | source of the sentence (optional, and only exists for scripted speech). These are reported by the person submitting sentences to a Common Voice project |
| style | "scripted" or "spontaneous" |
| split | one of "train", "dev", or "test. These splits do not correspond to the MCV splits |
Preprocessing
The data was deduplicated on sentence+lang.
Sample
| id | sentence | lang | sentence_domain | source | style | split |
|---|---|---|---|---|---|---|
| 13914460 | Il più grande di loro fu Giovanni evangelista. | it | wiki | scripted | train | |
| 7896754 | The underground system develops in the limestone and marl-limestone of the superior Cretaceous. | en | covost2-xx_en | scripted | train | |
| 10580213 | À boire! répéta pour la troisième fois Quasimodo pantelant. | fr | sentence-collector | scripted | test | |
| 14529549 | 岩手県宮古市 | ja | ????? | scripted | train | |
| 1076937 | Y en ixas va alentar de segundas. | an | general | self | scripted | train |
| 18424113 | ข้อบังคับการประชุม | th | sentence-collector | scripted | train |