Central Kurdish TTS dataset 1.0
License:
CC-BY-4.0
Steward:
The University of MelbourneTask: TTS
Release Date: 12/15/2025
Format: wav
Size: 293.45 MB
Share
Description
This dataset contains high-quality single-speaker audio recordings in Central Kurdish (ckb), intended for building Text-to-Speech (TTS) and Automatic Speech recognition (ASR) systems. The dataset comprises 2 hours and 18 minutes of aligned audio and text data.
Specifics
Licensing
Creative Commons Attribution 4.0 International (CC-BY-4.0)
https://spdx.org/licenses/CC-BY-4.0.htmlConsiderations
Restrictions/Special Constraints
The audio data in this dataset represents the personal voice of the speaker, Aso Mahmudi. While this dataset is provided for research and development, it is strictly forbidden to use this dataset to clone, mimic, or impersonate the speaker for deceptive, malicious, or non-consensual purposes.
Forbidden Usage
By using this dataset, you agree to the following restrictions. You may not use this dataset to: - Build or train systems that generate hate speech, discriminatory language, or content that targets groups in harmful ways. - Conduct surveillance, intrusive monitoring, or any privacy-violating applications. - Manipulate political discourse, influence elections, or perform political propaganda. - Generate violent, inciting, or hateful content, or content that promotes violence and aggression.
Metadata
Central Kurdish TTS Dataset
Dataset Description
This dataset contains high-quality single-speaker audio recordings in Central Kurdish, intended for building Text-to-Speech (TTS) systems. The dataset comprises 2 hours and 18 minutes of aligned audio and text data.
Language: Central Kurdish
ISO Code:
ckbTotal Duration: 2 hours, 18 minutes
Total Files: 1,653 WAV files
Script: Standard Arabic script of Kurdish
Included Letters: ئ ا ب پ ت ج چ ح خ د ر ڕ ز ژ س ش ع غ ف ڤ ق ک گ ل ڵ م ن و ۆ هـ ە ی ێ
Included Punctuation Marks: . ، ؟ ! : ؛
Speaker Information
The dataset features a single male speaker with a native accent from Mariwan.
Speaker Name: Aso Mahmudi
Gender: Male
Origin/Accent: Mariwan, Kurdistan
Recording Environment: Home Studio
Data Sources
The transcriptions used for the recordings are derived from a mix of classical and modern sources to ensure lexical, phonetic, and stylistic variety:
Literature: Full text of the book "Mesele-y Wijdan" by Ahmad Mukhtar Jaff (1896–1935). [49 minutes]
Web: Various texts extracted from the Kurdish websites.
Quality Control: All texts have been manually reviewed to ensure they exactly match the audio recordings.
Technical Specifications
Microphone: FIFINE Studio Condenser USB Microphone
Audio Format: WAV
Sampling Rate: 22050 Hz
Bit Depth: 16-bit
Channels: Mono
Dataset Structure
The dataset consists of a folder of audio files and a metadata CSV file.
Metadata Format
The metadata.csv uses a pipe (|) delimiter.
Columns:
file_name: The name of the audio file (without extension or with extension, depending on your setup).text: The transcription in the standard Kurdish Arabic script.
Example:
file_001.wav|ئەمە نموونەیەکە بۆ تاقیکردنەوە
file_002.wav|دەقی کتێبی مەسەلەی ویژدان