Hawrami Kurdish TTS dataset 1.0

License:

CC-BY-4.0

Steward:

The University of Melbourne

Task: TTS

Release Date: 1/30/2026

Format: WAV

Size: 706.11 MB

Description

This dataset contains high-quality single-speaker audio recordings in Hawrami Kurdish (Hewrami, ISO 639-3:hac), also known as the Gorani language, intended for building Text-to-Speech (TTS) and Automatic Speech recognition (ASR) systems. The dataset comprises 5 hours and 15 minutes of aligned audio and text data. Hawrami is classified as Definitely Endangered by UNESCO.

Specifics

Licensing

Creative Commons Attribution 4.0 International (CC-BY-4.0)

https://spdx.org/licenses/CC-BY-4.0.html

Considerations

Restrictions/Special Constraints

The audio data in this dataset represents the personal voice of the speaker, Ako Marani. While this dataset is provided for research and development, it is strictly forbidden to use this dataset to clone, mimic, or impersonate the speaker for deceptive, malicious, or non-consensual purposes.

Forbidden Usage

By using this dataset, you agree to the following restrictions. You may not use this dataset to: - Build or train systems that generate hate speech, discriminatory language, or content that targets groups in harmful ways. - Conduct surveillance, intrusive monitoring, or any privacy-violating applications. - Manipulate political discourse, influence elections, or perform political propaganda. - Generate violent, inciting, or hateful content, or content that promotes violence and aggression.

Metadata

Hawrami Kurdish TTS Dataset

Dataset Description

This dataset contains high-quality single-speaker audio recordings in Hawrami Kurdish, intended for building Text-to-Speech (TTS) systems. The dataset comprises 5 hours and 15 minutes of aligned audio and text data.

Language: Hawrami Kurdish
ISO Code: hac
Total Duration: 5 hours, 15 minutes
Total Files: 2,152 WAV files

Hawrami is usually written in Arabic script. Historically, it followed the Persian writing system. More recently, after the standardisation of the Central Kurdish Arabic-based alphabet, Hawrami largely adopted this system. However, there is no consensus on how to represent Hawrami-specific phonemes. For this reason, we provide transcriptions in three variants. The first uses more special letters (ڎ، ۋ، ې، ۊ، ڼ) and is commonly used by writers, including the speaker in this dataset. The second adds four letters (ڎ، ۋ، ؽ، ۉ). The third closely follows the standard Kurdish alphabet, with only one additional character (ڎ).

Script: Arabic script of Hawrami
Included Letters (variant 1): ئ ا ب پ ت ج چ ح خ د ر ڕ ز ژ س ش ع غ ف ڤ ق ک گ ل ڵ م ن و ۆ هـ ە ی ێ ڎ ۋ ې ۊ ڼ
Included Letters (variant 2): ئ ا ب پ ت ج چ ح خ د ر ڕ ز ژ س ش ع غ ف ڤ ق ک گ ل ڵ م ن و ۆ هـ ە ی ێ ڎ ۋ ؽ ۉ
Included Letters (variant 3): ئ ا ب پ ت ج چ ح خ د ر ڕ ز ژ س ش ع غ ف ڤ ق ک گ ل ڵ م ن و ۆ هـ ە ی ێ ڎ
Included Punctuation Marks: . ، ؟ ! :

Speaker Information

The dataset features a single male speaker with a native accent from Tawella (Tewêle).

Speaker Name: Ako Marani
Gender: Male
Origin/Accent: Tewêle, Kurdistan
Recording Environment: Home Studio

Data Sources

The transcriptions used for the recordings are derived from full texts of two sources written by Ako Marani :

هەنگامې کرڎەییې پەی ئاۋەڎانکەرڎەی ناوچەو هۆرامانی و ئەۋەگېڵنای زۋانی هۆرامی پەی بوارو وەنەی و ڕاۋەبەری. [1 hours and 31 minutes]
تان و پۊ وېڕاۋەبەری کۊمەڵایەتی. [3 hours and 44 minutes]

Quality Control: All texts have been manually reviewed to ensure they exactly match the audio recordings.

Technical Specifications

Audio Format: WAV
Sampling Rate: 22050 Hz
Bit Depth: 16-bit
Channels: Mono

Dataset Structure

The dataset consists of a folder of audio files and a metadata CSV file.

Metadata Format

The metadata.csv uses a pipe (|) delimiter.

Columns:

file_name: The name of the audio file (without extension or with extension, depending on your setup).
text: The transcription in the standard Kurdish Arabic script.

Example:

A0001.wav|هەنگامې کرڎەییې پەی ئاۋەڎانکەرڎەی ناوچەو هۆرامانی
B0003.wav|نۋیستەی: ئاکۆ مارانی