Urdu Multi-Speaker TTS Dataset

License:

CC-BY-NC-4.0

Steward:

Community

Task: TTS

Release Date: 3/18/2026

Format: WEBM, TSV

Size: 514.54 MB

Description

This dataset is an Urdu text-to-speech corpus designed for speech technology development and related computational research. It contains approximately 10 hours of speech from 3 speakers, including 2 male and 1 female speaker. The data is distributed across 36 zip files, and each zip file includes a folder of audio files along with a CSV file that maps each audio file to its corresponding transcript. The recordings are drawn from the domains of newspaper, literature, and articles, providing a mix of formal, narrative, and informational language suitable for Urdu TTS, corpus creation, and speaker-based speech modeling.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

This dataset is intended for research, educational, and speech technology development purposes only, and it must not be used in ways that violate privacy, misrepresent the speakers or content, or cause harm.

Forbidden Usage

You agree not to identify speakers, clone or imitate their voices, or use this dataset to train chatbots, large language models, or any deceptive, harmful, or privacy-violating systems.

Processes

Ethical Review

All speakers and data providers were informed about the purpose of the dataset, its intended research and computational uses, and relevant privacy considerations before the data was collected and shared.

Intended Use

This dataset is intended for use in developing and evaluating Urdu text-to-speech systems and related speech and language technology applications.

Metadata

Language

Urdu is an Indo-Aryan language spoken widely in Pakistan and India, with additional speaker communities across the global diaspora. It is used in education, media, literature, journalism, and everyday communication, and it is an important language for speech technology, corpus development, and natural language processing.

Script

Urdu is written in a Perso-Arabic script, typically in the Nastaliq style. The core alphabet includes ا، ب، پ، ت، ٹ، ث، ج، چ، ح، خ، د، ڈ، ذ، ر، ڑ، ز، ژ، س، ش، ص، ض، ط، ظ، ع، غ، ف، ق، ک، گ، ل، م، ن، ں، و، ہ، ھ، ء، ی، ے. The text in the dataset follows standard Urdu orthography used in written materials prepared for speech applications.

Size

The dataset is distributed in 36 zip files. Together, the dataset contains approximately 10 hours of speech from 3 speakers.

Source

This dataset is an Urdu text-to-speech corpus consisting of audio recordings and transcript mappings prepared for speech technology development. It includes 3 speakers in total, with 2 male and 1 female speaker. The recordings are drawn from the domains of newspaper, literature, and articles.

Data Structure

The dataset is distributed in 36 zip files.
Each zip file contains a CSV mapping file.
Each CSV file links the audio file names to their corresponding text.
Each zip file also contains a folder of audio files.
The dataset includes recordings from 3 speakers in total.
The speaker distribution is 2 male and 1 female.
The total duration of the dataset is approximately 10 hours.
The structure supports text-to-speech development, corpus organization, and speaker-based analysis.

Domain

The text and recordings in this dataset are drawn from the domains of newspaper, literature, and articles. This provides a mix of formal, narrative, and informational language suitable for Urdu TTS and related speech technology tasks.

Recommended Processing

Extract all 36 zip files into a consistent directory structure.
Verify that each zip file contains both a CSV mapping file and its corresponding audio folder.
Merge or index the CSV mapping files into a unified metadata table if needed.
Match each audio file with its corresponding transcript entry from the CSV file.
Standardize audio into a consistent format such as WAV with uniform sampling rate, bit depth, and channel settings.
Normalize Urdu Unicode text to ensure consistent character representation.
Clean and normalize transcripts for punctuation, spacing, and orthographic consistency.
Preserve speaker metadata for multi-speaker TTS modeling and analysis.
Create metadata tables linking zip file, audio file name, speaker ID, gender, duration, and transcript text.
Validate audio-text pairings and check for missing, duplicated, or mismatched files.
Prepare the processed data for downstream tasks such as TTS training, ASR experiments, corpus indexing, and speaker-based speech modeling.

Sample

audio_filename,sentence
cce810809be164b683f09d6014da174f.webm,عربی خطاطی ہماری اسلامی ثقافتی میراث کا نمایاں حصہ ہے۔
0de36658119e79067601dd941d10b44f.webm,ہم سب، بالخصوص ہمارے خطاط حضرات اس عظیم ورثے کے امین ہیں۔
fd2491b6bf448078d0b2f515c44d2c2d.webm,یہ ہمارا فرض ہے کہ ہم اس وِرثہ کو نہ صرف ملکی سطح پر عام کریں بلکہ اقوامِ عالم کو بھی اس کی ضوفشانیوں سے روشناس کرائیں۔
eded62dd450e22e0c20499a46a05ae93.webm,اسی جذبے کے تحت قونصلیٹ جنرل آف پاکستان، جدہ نے جدہ شہر میں 01-04 اکتوبر تک خطاطی کی بے حد کامیاب نمائش بعنوان "علم بالقلم" کا انعقاد کیا۔
efd3ec7c978fb29334ad2c9c7f3343f6.webm,اس نمائش میں پاکستان کے چیدہ چیدہ خطاطین نے اپنے منتخب فن پاروں سمیت شرکت کی۔
9308d532f146a4050c258a7350587110.webm,محترم ابنِ کلیم نے اپنے سفر و قیامِ حجاز اور نمائش کے احوال کو الفاظ کی شکل دے کر قارئین کو حجازِ مقدس اور "علم بالقلم" نمائش کی سیر کرانے کی ایک کامیاب کوشش کی۔