Chuvash TTS

License:

CC-BY-SA-4.0

Steward:

Taruen

Task: TTS

Release Date: 4/2/2026

Format: PARQUET

Size: 854.02 MB

Description

Chuvash TTS is a speech dataset sourced from the Turkic_TTS GitHub repository. It comprises 4 hours and 8 minutes of news article text from chuvash.org and 1 hour and 1 minute of recorded digits, all read by a single female speaker at a rapid tempo. The dataset is provided as a Parquet file containing segmented audio, Chuvash transcriptions, and original filenames, and is intended specifically for text-to-speech (TTS) research and development in the Chuvash language.

Specifics

Licensing

Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)

https://spdx.org/licenses/CC-BY-SA-4.0.html

Considerations

Restrictions/Special Constraints

The original dataset authors must be cited if this dataset is used in your work.

Forbidden Usage

None

Processes

Intended Use

Text-to-speech (TTS) research and development in the Chuvash language.

Metadata

Chuvash TTS

[Original repository]

Dataset Summary

Chuvash TTS is a speech dataset sourced from the Turkic_TTS GitHub repository. It has been packaged and uploaded to the Mozilla Data Collective (MDC) with the explicit permission of the original author. The dataset comprises recordings of text extracted from news articles on chuvash.org and list of digits, all read by a single female speaker at a rapid tempo. The dataset is intended for text-to-speech (TTS) research and development in the Chuvash language. The license and citation information presented in this dataset card has been provided by the original dataset authors and is included with their permission.

Dataset Structure

Parts: The dataset contains two main subsets, indicated in the dataset_name column:

chuvash_org_news: Contains news article text from chuvash.org.
digits: Contains recordings of digits.

Data Fields:

audio: The segmented audio file.
text: The corresponding Chuvash transcription.
file_name: The corresponding original filename (also, in case of chuvash_org_news part it means id of a news page in form https://chuvash.org/news/{id}.html).
dataset_name: Indicates the subset (chuvash_org_news or digits).

Data Processing

Text Processing:

No normalization or text preprocessing was applied to the text.

Audio Processing:

All audio was segmented by splitting the complete recordings at pauses.
3 seconds were trimmed from the beginning and end of each file.

Technical Details:


Dataset Type	speech corpus for TTS
Language	cv/chv, Chuvash
Speech Style	scripted monologue
Content	news and list of digits
Audio Parameters	44.1 kHz, 32 bits, mono
File Format	WAV (PCM) TXT (UTF-8)
Recording Environment	quiet indoor environment

Total Duration:

News: 04:08:49 (4 hours, 8 minutes, 49 seconds)
Digits: 01:01:39 (1 hour, 1 minute, 39 seconds)

Usage Considerations

The dataset was not extensively preprocessed. Users are encouraged to perform additional preprocessing (e.g., normalization, cleaning, or re-segmentation) as needed for their specific applications.
There is variation in how abbreviations and shortenings are pronounced (e.g., чӑв., Ф.П. Павлов, etc.), as well as in the treatment of special signs such as @. Users should be aware of these inconsistencies, especially for tasks requiring strict normalization or uniformity in pronunciation.

Citation

If you use this dataset in your work, please, cite:

@misc{tyers2018speechsynthesis,
    title={{Speech synthesis on a shoe string}},
    author={Tyers, F. M.},
    year={2018},
    howpublished={Presentation at Computational Methods for 
                  Endangered Language Documentation and Description},
    address={Paris, France},
    date={2018-02-01},
}

License

This dataset is distributed under the CC BY-SA 4.0 license.