TidyVoiceX_ASV

License icon

License:

CC0-1.0

Shield icon

Steward:

TidyVoice2026 Challenge

Task: OTH

Release Date: 11/27/2025

Format: WAV

Size: 36.72 GB


Description

This dataset is designed for speaker verification using the Mozilla Common Voice corpus across 40 languages. It includes approximately 5,000 speakers who each have recordings in more than one language. Leveraging this multilingual overlap, we construct the trial pairs to explore cross-lingual variation in the speaker verification task.

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Forbidden Usage

According to Mozilla’s usage rules, it is forbidden to use this dataset for speaker identification or for recovering a speaker’s identity. The data MUST only be used for speaker verification tasks.

Processes

Intended Use

All rules and restrictions are the same as those of the original Mozilla Common Voice datasets.

Metadata

TidyVoiceX_ASV Dataset

Overview

TidyVoiceX is a large-scale, multilingual speech corpus specifically curated for cross-lingual speaker verification research. Derived from Mozilla Common Voice (MCV), this dataset is designed to isolate the effect of language switching across multiple languages, enabling focused investigation into language-independent speaker embeddings.

This dataset is part of the TidyVoice2026 Challenge, an official challenge at Interspeech 2026 focused on advancing cross-lingual speaker verification systems.

🌐 Challenge Website: https://tidyvoice2026.github.io/

Dataset Statistics

MetricTraining SetDevelopment SetTotal
Speakers3,6668084,474
Languages404040
Utterances262,00060,000321,711
Duration (hours)37087457
DomainRead SpeechRead SpeechRead Speech

Language Coverage

Training Languages (40 total)

The Tidy-X dataset includes the following 40 languages exclusively for training:

  1. ab (Abkhazian)

  2. ar (Arabic)

  3. ba (Bashkir)

  4. be (Belarusian)

  5. bg (Bulgarian)

  6. bn (Bengali)

  7. ca (Catalan)

  8. cv (Chuvash)

  9. cy (Welsh)

  10. de (German)

  11. dv (Dhivehi)

  12. el (Greek)

  13. en (English)

  14. fa (Persian)

  15. fr (French)

  16. ha (Hausa)

  17. hi (Hindi)

  18. hsb (Upper Sorbian)

  19. hy-AM (Armenian)

  20. ja (Japanese)

  21. ka (Georgian)

  22. lg (Luganda)

  23. lt (Lithuanian)

  24. mk (Macedonian)

  25. ml (Malayalam)

  26. mr (Marathi)

  27. nl (Dutch)

  28. or (Odia)

  29. pl (Polish)

  30. pt (Portuguese)

  31. ru (Russian)

  32. ta (Tamil)

  33. th (Thai)

  34. tk (Turkmen)

  35. tr (Turkish)

  36. ug (Uyghur)

  37. uz (Uzbek)

  38. yo (Yoruba)

  39. yue (Cantonese)

  40. zh-CN (Chinese)

Download Links

Trial Pairs for Development Set

📥 Download: Development Set Trial Pairs

The trial pairs file contains the trial pairs for the development set, formatted according to the challenge specifications.

Key Features

  • Multilingual Scope: 40 training languages covering diverse language families

  • Cross-lingual Focus: Designed to evaluate speaker verification under language mismatch conditions

  • Pseudonymized IDs: All speaker identities are pseudonymized to protect privacy

  • Controlled Domain: Read speech domain minimizes stylistic and phonetic variability

  • Open Access: Publicly available for research purposes

  • Standardized Splits: Clear train/development separation for reproducible research

  • Audio Format: WAV format, 16 kHz sampling frequency

Use Cases

This dataset is specifically designed for:

  • Cross-lingual speaker verification research

  • Language-independent speaker embedding development

  • Multilingual speaker recognition systems

  • Fairness and bias evaluation in speaker verification

Citation

If you use the Tidy-X dataset in your research, please cite:

@inproceedings{farhadipour2026tidyvoice,
  title={TidyVoice Challenge: Cross-Lingual Speaker Verification},
  author={Farhadipour, Aref and Marquenie, Jan and Madikeri, Srikanth and Vukovic, Teodora and Dellwo, Volker and Reid, Kathy and Tyers, Francis M. and Siegert, Ingo and Chodroff, Eleanor},
  booktitle={Interspeech 2026},
  year={2026}
}

License

This dataset is derived from Mozilla Common Voice. Please refer to the Mozilla Common Voice license for usage terms and conditions.

Contact

For questions, issues, or contributions, please visit:

Acknowledgments

This dataset was created as part of the TidyVoice2026 Challenge at Interspeech 2026, developed by researchers from the University of Zurich, Otto-von-Guericke-University Magdeburg, Mozilla Foundation, Indiana University and Australian National University.