Saraiki 10 Hours TTS Dataset

License:

CC-BY-NC-SA-4.0

Steward:

MirasAI

Task: ASR

Release Date: 4/1/2026

Format: WEBM, TSV

Size: 584.44 MB

Description

he Saraiki TTS Dataset – 10 Hours is a curated speech dataset developed to support research and development in text-to-speech (TTS) and related speech technologies for the Saraiki language. The dataset contains approximately 10 hours of audio recordings with corresponding text transcripts, prepared for high-quality speech synthesis and language technology applications. The recordings are designed to reflect clear and natural Saraiki speech suitable for TTS model training, evaluation, and benchmarking. The dataset can support the development of voice generation systems, pronunciation modeling, speech representation learning, and low-resource language speech technologies. It is particularly valuable for advancing speech tools for Saraiki, a language that remains underrepresented in current AI and speech ecosystems. The dataset may also be useful for broader research in computational linguistics, language preservation, accessibility technologies, and inclusive AI development.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

This dataset may be used for research, educational, and other non-commercial purposes with proper attribution. Any redistributed or adapted versions must be shared under the same license terms.

Forbidden Usage

Any attempt to identify speakers is prohibited. Voice cloning or building systems intended to imitate the original speakers is forbidden. Commercial use of the dataset is not allowed without separate permission.

Saraiki 10 Hours TTS Dataset

Description

Specifics

Considerations

Metadata