Central Kurdish TTS dataset 1.0 | Mozilla Data Collective

Specifics

Licensing

Creative Commons Attribution 4.0 International (CC-BY-4.0)

https://spdx.org/licenses/CC-BY-4.0.html

Considerations

Restrictions/Special Constraints

The audio data in this dataset represents the personal voice of the speaker, Aso Mahmudi. While this dataset is provided for research and development, it is strictly forbidden to use this dataset to clone, mimic, or impersonate the speaker for deceptive, malicious, or non-consensual purposes.

Forbidden Usage

By using this dataset, you agree to the following restrictions. You may not use this dataset to: - Build or train systems that generate hate speech, discriminatory language, or content that targets groups in harmful ways. - Conduct surveillance, intrusive monitoring, or any privacy-violating applications. - Manipulate political discourse, influence elections, or perform political propaganda. - Generate violent, inciting, or hateful content, or content that promotes violence and aggression.

Metadata

Central Kurdish TTS Dataset

Dataset Description

This dataset contains high-quality single-speaker audio recordings in Central Kurdish, intended for building Text-to-Speech (TTS) systems. The dataset comprises 2 hours and 18 minutes of aligned audio and text data.

Language: Central Kurdish
ISO Code: ckb
Total Duration: 2 hours, 18 minutes
Total Files: 1,653 WAV files
Script: Standard Arabic script of Kurdish
Included Letters: ئ ا ب پ ت ج چ ح خ د ر ڕ ز ژ س ش ع غ ف ڤ ق ک گ ل ڵ م ن و ۆ هـ ە ی ێ
Included Punctuation Marks: . ، ؟ ! : ؛

Speaker Information

The dataset features a single male speaker with a native accent from Mariwan.

Speaker Name: Aso Mahmudi
Gender: Male
Origin/Accent: Mariwan, Kurdistan
Recording Environment: Home Studio

Data Sources

The transcriptions used for the recordings are derived from a mix of classical and modern sources to ensure lexical, phonetic, and stylistic variety:

Literature: Full text of the book "Mesele-y Wijdan" by Ahmad Mukhtar Jaff (1896–1935). [49 minutes]
Web: Various texts extracted from the Kurdish websites.

Quality Control: All texts have been manually reviewed to ensure they exactly match the audio recordings.

Technical Specifications

Microphone: FIFINE Studio Condenser USB Microphone
Audio Format: WAV
Sampling Rate: 22050 Hz
Bit Depth: 16-bit
Channels: Mono

Dataset Structure

The dataset consists of a folder of audio files and a metadata CSV file.

Metadata Format

The metadata.csv uses a pipe (|) delimiter.

Columns:

file_name: The name of the audio file (without extension or with extension, depending on your setup).
text: The transcription in the standard Kurdish Arabic script.

Example:

file_001.wav|ئەمە نموونەیەکە بۆ تاقیکردنەوە
file_002.wav|دەقی کتێبی مەسەلەی ویژدان