Central Kurdish TTS dataset 1.0

License icon

License:

CC-BY-4.0

Shield icon

Steward:

The University of Melbourne

Task: TTS

Release Date: 12/15/2025

Format: wav

Size: 293.45 MB


Share

Description

This dataset contains high-quality single-speaker audio recordings in Central Kurdish (ckb), intended for building Text-to-Speech (TTS) and Automatic Speech recognition (ASR) systems. The dataset comprises 2 hours and 18 minutes of aligned audio and text data.

Specifics

Licensing

Creative Commons Attribution 4.0 International (CC-BY-4.0)

https://spdx.org/licenses/CC-BY-4.0.html

Considerations

Restrictions/Special Constraints

The audio data in this dataset represents the personal voice of the speaker, Aso Mahmudi. While this dataset is provided for research and development, it is strictly forbidden to use this dataset to clone, mimic, or impersonate the speaker for deceptive, malicious, or non-consensual purposes.

Forbidden Usage

By using this dataset, you agree to the following restrictions. You may not use this dataset to: - Build or train systems that generate hate speech, discriminatory language, or content that targets groups in harmful ways. - Conduct surveillance, intrusive monitoring, or any privacy-violating applications. - Manipulate political discourse, influence elections, or perform political propaganda. - Generate violent, inciting, or hateful content, or content that promotes violence and aggression.

Metadata

Central Kurdish TTS Dataset

Dataset Description

This dataset contains high-quality single-speaker audio recordings in Central Kurdish, intended for building Text-to-Speech (TTS) systems. The dataset comprises 2 hours and 18 minutes of aligned audio and text data.

  • Language: Central Kurdish

  • ISO Code: ckb

  • Total Duration: 2 hours, 18 minutes

  • Total Files: 1,653 WAV files

  • Script: Standard Arabic script of Kurdish

  • Included Letters: ئ ا ب پ ت ج چ ح خ د ر ڕ ز ژ س ش ع غ ف ڤ ق ک گ ل ڵ م ن و ۆ هـ ە ی ێ

  • Included Punctuation Marks: . ، ؟ ! : ؛

Speaker Information

The dataset features a single male speaker with a native accent from Mariwan.

  • Speaker Name: Aso Mahmudi

  • Gender: Male

  • Origin/Accent: Mariwan, Kurdistan

  • Recording Environment: Home Studio

Data Sources

The transcriptions used for the recordings are derived from a mix of classical and modern sources to ensure lexical, phonetic, and stylistic variety:

  1. Literature: Full text of the book "Mesele-y Wijdan" by Ahmad Mukhtar Jaff (1896–1935). [49 minutes]

  2. Web: Various texts extracted from the Kurdish websites.

Quality Control: All texts have been manually reviewed to ensure they exactly match the audio recordings.

Technical Specifications

  • Microphone: FIFINE Studio Condenser USB Microphone

  • Audio Format: WAV

  • Sampling Rate: 22050 Hz

  • Bit Depth: 16-bit

  • Channels: Mono

Dataset Structure

The dataset consists of a folder of audio files and a metadata CSV file.

Metadata Format

The metadata.csv uses a pipe (|) delimiter.

Columns:

  1. file_name: The name of the audio file (without extension or with extension, depending on your setup).

  2. text: The transcription in the standard Kurdish Arabic script.

Example:

file_001.wav|ئەمە نموونەیەکە بۆ تاقیکردنەوە
file_002.wav|دەقی کتێبی مەسەلەی ویژدان