Chuvash TTS
License:
CC-BY-SA-4.0
Steward:
TaruenTask: TTS
Release Date: 4/2/2026
Format: PARQUET
Size: 854.02 MB
Share
Description
Chuvash TTS is a speech dataset sourced from the Turkic_TTS GitHub repository. It comprises 4 hours and 8 minutes of news article text from chuvash.org and 1 hour and 1 minute of recorded digits, all read by a single female speaker at a rapid tempo. The dataset is provided as a Parquet file containing segmented audio, Chuvash transcriptions, and original filenames, and is intended specifically for text-to-speech (TTS) research and development in the Chuvash language.
Specifics
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
The original dataset authors must be cited if this dataset is used in your work.
Forbidden Usage
None
Processes
Intended Use
Text-to-speech (TTS) research and development in the Chuvash language.
Metadata
Chuvash TTS
Dataset Summary
Chuvash TTS is a speech dataset sourced from the Turkic_TTS GitHub
repository. It has been
packaged and uploaded to the Mozilla Data Collective (MDC) with the
explicit permission of the original author. The dataset comprises
recordings of text extracted from news articles on chuvash.org and
list of digits, all read by a single female speaker at a rapid tempo.
The dataset is intended for text-to-speech (TTS) research and
development in the Chuvash language. The license and citation
information presented in this dataset card has been provided by the
original dataset authors and is included with their permission.
Dataset Structure
Parts: The dataset contains two main subsets, indicated in the
dataset_name column:
chuvash_org_news: Contains news article text from chuvash.org.digits: Contains recordings of digits.
Data Fields:
audio: The segmented audio file.text: The corresponding Chuvash transcription.file_name: The corresponding original filename (also, in case of chuvash_org_news part it means id of a news page in form https://chuvash.org/news/{id}.html).dataset_name: Indicates the subset (chuvash_org_newsordigits).
Data Processing
Text Processing:
No normalization or text preprocessing was applied to the text.
Audio Processing:
All audio was segmented by splitting the complete recordings at pauses.
3 seconds were trimmed from the beginning and end of each file.
Technical Details:
| Dataset Type | speech corpus for TTS |
| Language | cv/chv, Chuvash |
| Speech Style | scripted monologue |
| Content | news and list of digits |
| Audio Parameters | 44.1 kHz, 32 bits, mono |
| File Format | WAV (PCM) TXT (UTF-8) |
| Recording Environment | quiet indoor environment |
Total Duration:
News: 04:08:49 (4 hours, 8 minutes, 49 seconds)Digits: 01:01:39 (1 hour, 1 minute, 39 seconds)
Usage Considerations
The dataset was not extensively preprocessed. Users are encouraged to perform additional preprocessing (e.g., normalization, cleaning, or re-segmentation) as needed for their specific applications.
There is variation in how abbreviations and shortenings are pronounced (e.g., чӑв., Ф.П. Павлов, etc.), as well as in the treatment of special signs such as @. Users should be aware of these inconsistencies, especially for tasks requiring strict normalization or uniformity in pronunciation.
Citation
If you use this dataset in your work, please, cite:
@misc{tyers2018speechsynthesis,
title={{Speech synthesis on a shoe string}},
author={Tyers, F. M.},
year={2018},
howpublished={Presentation at Computational Methods for
Endangered Language Documentation and Description},
address={Paris, France},
date={2018-02-01},
}
License
This dataset is distributed under the CC BY-SA 4.0 license.