TidyVoiceX2_ASV

License:

CC0-1.0

Steward:

TidyVoice2026 Challenge

Task: OTH

Release Date: 1/26/2026

Format: WAV

Size: 23.11 GB

Description

This dataset is designed for speaker verification using the Mozilla Common Voice corpus, covering approximately 40 additional languages beyond those included in TidyVoiceX_ASV. It comprises recordings from different speakers, each of whom appears in multiple languages. Leveraging this multilingual overlap, we construct trial pairs to investigate cross-lingual variation in the speaker verification task. This dataset served as the evaluation set for the TidyVoice 2026 Challenge.

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Restrictions/Special Constraints

According to Mozilla’s usage rules, it is forbidden to use this dataset for speaker identification or for recovering a speaker’s identity.

Forbidden Usage

According to Mozilla’s usage rules, it is forbidden to use this dataset for speaker identification or for recovering a speaker’s identity. The data MUST only be used for speaker verification tasks.

Processes

Intended Use

All rules and restrictions are the same as those of the original Mozilla Common Voice datasets.

Metadata

TidyVoiceX2_ASV Dataset

Overview

TidyVoiceX2 is a large-scale, multilingual speech corpus specifically curated for cross-lingual speaker verification research and serves as the evaluation set for the TidyVoiceX_ASV dataset. Derived from Mozilla Common Voice (MCV), this dataset is designed to isolate the effect of language switching across multiple languages, enabling focused investigation into language-independent speaker embedding.

This dataset is part of the TidyVoice2026 Challenge, an official challenge at Interspeech 2026 focused on advancing cross-lingual speaker verification systems.

🌐 Challenge Website: https://tidyvoice2026.github.io/

Download Links

Trial Pairs for Evaluation Set

📥 Download: Evaluation Set Trial Pairs

The trial pairs file contains the trial pairs for the Evaluation set, without labels, formatted according to the challenge specifications.

Use Cases

This dataset is specifically designed for:

Cross-lingual speaker verification research
Language-independent speaker embedding development
Multilingual speaker recognition systems
Fairness and bias evaluation in speaker verification

Citation

If you use the Tidy-X dataset in your research, please cite:

@inproceedings{farhadipour2026tidyvoice,
  title={TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice },
  author={Farhadipour, Aref and Marquenie, Jan and Madikeri, Srikanth and Chodroff, Eleanor},
  booktitle={Interspeech 2026},
  year={2026}
}

License

This dataset is derived from Mozilla Common Voice. Please refer to the Mozilla Common Voice license for usage terms and conditions.

Contact

For questions, issues, or contributions, please visit:

Challenge Website: https://tidyvoice2026.github.io/
Email: aref.farhadipour@uzh.ch, areffarhadi@gmail.com