Saraiki 10 Hours TTS Dataset
License:
CC-BY-NC-SA-4.0
Steward:
MirasAITask: ASR
Release Date: 4/1/2026
Format: WEBM, TSV
Size: 584.44 MB
Share
Description
he Saraiki TTS Dataset – 10 Hours is a curated speech dataset developed to support research and development in text-to-speech (TTS) and related speech technologies for the Saraiki language. The dataset contains approximately 10 hours of audio recordings with corresponding text transcripts, prepared for high-quality speech synthesis and language technology applications. The recordings are designed to reflect clear and natural Saraiki speech suitable for TTS model training, evaluation, and benchmarking. The dataset can support the development of voice generation systems, pronunciation modeling, speech representation learning, and low-resource language speech technologies. It is particularly valuable for advancing speech tools for Saraiki, a language that remains underrepresented in current AI and speech ecosystems. The dataset may also be useful for broader research in computational linguistics, language preservation, accessibility technologies, and inclusive AI development.
Specifics
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
This dataset may be used for research, educational, and other non-commercial purposes with proper attribution. Any redistributed or adapted versions must be shared under the same license terms.
Forbidden Usage
Any attempt to identify speakers is prohibited. Voice cloning or building systems intended to imitate the original speakers is forbidden. Commercial use of the dataset is not allowed without separate permission.