Hawrami Kurdish TTS dataset 1.0
License:
CC-BY-4.0
Steward:
The University of MelbourneTask: TTS
Release Date: 1/30/2026
Format: WAV
Size: 706.11 MB
Share
Description
This dataset contains high-quality single-speaker audio recordings in Hawrami Kurdish (Hewrami, ISO 639-3:hac), also known as the Gorani language, intended for building Text-to-Speech (TTS) and Automatic Speech recognition (ASR) systems. The dataset comprises 5 hours and 15 minutes of aligned audio and text data. Hawrami is classified as Definitely Endangered by UNESCO.
Specifics
Licensing
Creative Commons Attribution 4.0 International (CC-BY-4.0)
https://spdx.org/licenses/CC-BY-4.0.htmlConsiderations
Restrictions/Special Constraints
The audio data in this dataset represents the personal voice of the speaker, Ako Marani. While this dataset is provided for research and development, it is strictly forbidden to use this dataset to clone, mimic, or impersonate the speaker for deceptive, malicious, or non-consensual purposes.
Forbidden Usage
By using this dataset, you agree to the following restrictions. You may not use this dataset to: - Build or train systems that generate hate speech, discriminatory language, or content that targets groups in harmful ways. - Conduct surveillance, intrusive monitoring, or any privacy-violating applications. - Manipulate political discourse, influence elections, or perform political propaganda. - Generate violent, inciting, or hateful content, or content that promotes violence and aggression.
Metadata
Hawrami Kurdish TTS Dataset
Dataset Description
This dataset contains high-quality single-speaker audio recordings in Hawrami Kurdish, intended for building Text-to-Speech (TTS) systems. The dataset comprises 5 hours and 15 minutes of aligned audio and text data.
Language: Hawrami Kurdish
ISO Code:
hacTotal Duration: 5 hours, 15 minutes
Total Files: 2,152 WAV files
Hawrami is usually written in Arabic script. Historically, it followed the Persian writing system. More recently, after the standardisation of the Central Kurdish Arabic-based alphabet, Hawrami largely adopted this system. However, there is no consensus on how to represent Hawrami-specific phonemes. For this reason, we provide transcriptions in three variants. The first uses more special letters (ڎ، ۋ، ې، ۊ، ڼ) and is commonly used by writers, including the speaker in this dataset. The second adds four letters (ڎ، ۋ، ؽ، ۉ). The third closely follows the standard Kurdish alphabet, with only one additional character (ڎ).
Script: Arabic script of Hawrami
Included Letters (variant 1): ئ ا ب پ ت ج چ ح خ د ر ڕ ز ژ س ش ع غ ف ڤ ق ک گ ل ڵ م ن و ۆ هـ ە ی ێ ڎ ۋ ې ۊ ڼ
Included Letters (variant 2): ئ ا ب پ ت ج چ ح خ د ر ڕ ز ژ س ش ع غ ف ڤ ق ک گ ل ڵ م ن و ۆ هـ ە ی ێ ڎ ۋ ؽ ۉ
Included Letters (variant 3): ئ ا ب پ ت ج چ ح خ د ر ڕ ز ژ س ش ع غ ف ڤ ق ک گ ل ڵ م ن و ۆ هـ ە ی ێ ڎ
Included Punctuation Marks: . ، ؟ ! :
Speaker Information
The dataset features a single male speaker with a native accent from Tawella (Tewêle).
Speaker Name: Ako Marani
Gender: Male
Origin/Accent: Tewêle, Kurdistan
Recording Environment: Home Studio
Data Sources
The transcriptions used for the recordings are derived from full texts of two sources written by Ako Marani :
هەنگامې کرڎەییې پەی ئاۋەڎانکەرڎەی ناوچەو هۆرامانی و ئەۋەگېڵنای زۋانی هۆرامی پەی بوارو وەنەی و ڕاۋەبەری. [1 hours and 31 minutes]
تان و پۊ وېڕاۋەبەری کۊمەڵایەتی. [3 hours and 44 minutes]
Quality Control: All texts have been manually reviewed to ensure they exactly match the audio recordings.
Technical Specifications
Audio Format: WAV
Sampling Rate: 22050 Hz
Bit Depth: 16-bit
Channels: Mono
Dataset Structure
The dataset consists of a folder of audio files and a metadata CSV file.
Metadata Format
The metadata.csv uses a pipe (|) delimiter.
Columns:
file_name: The name of the audio file (without extension or with extension, depending on your setup).text: The transcription in the standard Kurdish Arabic script.
Example:
A0001.wav|هەنگامې کرڎەییې پەی ئاۋەڎانکەرڎەی ناوچەو هۆرامانی
B0003.wav|نۋیستەی: ئاکۆ مارانی