Hausa-TTS-Dataset
License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesTask: TTS
Release Date: 4/7/2026
Format: MP3, TSV
Size: 276.90 MB
Share
Description
This dataset comprises audio recordings of Hausa speech aligned with textual transcriptions. The dataset is structured into 19 folders, each containing audio files and a corresponding audio-text mapping file. The audio clips are short, typically ranging from 1 to 23 seconds, and are suitable for training and evaluating Text-to-Speech (TTS) systems. The dataset follows a structured format where each audio file is paired with its corresponding transcription in a tab-separated mapping file. The textual content used in this dataset originates from a variety of written sources in Hausa, including encyclopaedic and informational texts. These texts were segmented into short utterances suitable for read speech and TTS modelling.
Specifics
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseConsiderations
Restrictions/Special Constraints
- For research and scientific use only - You agree not to re-host or redistribute this dataset
Forbidden Usage
You agree not to use the data for: - Generative AI - Voice cloning or speaker imitation - Reproduction, duplication, modification, or redistribution - Commercial use without explicit permission
Processes
Intended Use
This dataset is intended for the training and evaluation of Text-to-Speech (TTS) systems for the Hausa language. It aims to support: - Language technology development for one of Africa's most widely spoken languages - Development of speech technologies for under-served African language communities - Educational applications in multilingual contexts - Research in low-resource and African language speech synthesis
Metadata
Language
Hausa (native name: Hausa / هَوْسَ; Halshen Hausa / هَرْشَن هَوْسَ) is a West Chadic language of the Afro-Asiatic language family. It is the most widely spoken language in the Chadic branch and one of the largest languages on the African continent. Hausa is spoken as a first language by approximately 50–80 million people, and as a second or trade language by an additional 20–50 million, with total speaker estimates ranging from 72 million to 150 million depending on the source. It is the majority language of much of northern Nigeria and the neighbouring Republic of Niger, and serves as an important lingua franca for trade and commerce across West and Central Africa. Hausa is also spoken in communities across Cameroon, Chad, Ghana, Togo, Benin, Sudan, and Côte d'Ivoire, as well as in diaspora communities in Europe and North America.
Hausa is recognized as an indigenous national language in the constitutions of both Nigeria and Niger. In Nigeria, it is the de facto official language of several northern states, used in education, the media, and public life. In Niger, it was declared an official language in 2025.
Variants
Hausa presents a wide degree of uniformity across the regions where it is spoken, and all dialects are mutually intelligible. However, linguists have identified a number of geographically distinct dialect areas, broadly grouped into Eastern, Western, Northern, and Southern varieties.
Eastern Hausa dialects include:
Kananci, spoken in Kano and surrounding areas — the basis of Standard Hausa
Dauranci, spoken in Daura
Bausanci (Gudduranci), spoken in Bauchi and Katagum
Hadejanci, spoken in Hadejiya
Western Hausa dialects include:
Sakkwatanci, spoken in Sokoto — also known as Classical Hausa, used in a rich tradition of classical Hausa literature
Katsinanci, spoken in Katsina — transitional between Eastern and Western dialects
Kabanci, spoken in Kebbi
Zanhwaranci, spoken in Zamfara
Gobiranci, spoken in Gobir
Adaranci, spoken in Ader (Niger)
Arewanci, spoken in Dogondoutchi (Niger)
Kurfayanci, spoken in Kourfeye (Niger)
Damagaranci, spoken in Damagaram/Zinder (Niger)
Tibiranci, spoken in Maradi (Niger)
Southern Hausa dialects include:
Zazzaganci, spoken in Zaria and the Zazzau region — a major southern dialect that has contributed to innovations in written and spoken Hausa
Other dialects:
Gaananci, spoken in Ghana, Togo, and Mali — a distinct western native Hausa dialect bloc, characterized by the use of /c/ for /ky/ and /j/ for /gy/
The variant represented in this dataset is Kananci, the dialect of Kano, the largest commercial city in the Hausa-speaking world. Kananci is the basis of Standard Hausa (Daidaitacciyar Hausa), which is used in nearly all printed materials including newspapers, school textbooks, and broadcast media. International broadcasters such as the BBC, Deutsche Welle, Radio France Internationale, and Voice of America use Standard Hausa (Kananci/Dauranci) for their Hausa services.
Key phonological features of Kananci that distinguish it from other dialects include the pronunciation of /u/ in words such as sauka ('descend') and zauna ('sit down'), where other dialects (particularly Western Hausa) retain the consonants /b/, /f/, and /m/ before other consonants (sabka, zamna). Kananci also consistently maintains gender distinctions in all nouns, marking masculine nouns with ne and feminine nouns with ce.
Writing System
The Hausa language has two writing systems: Boko (the standard romanized script) and Ajami (an Arabic-derived script).
1. Boko (Romanized script)
Boko was developed during the 19th century and standardized through the Pan-Nigerian Alphabet in the 1980s. It is the dominant writing system for Hausa today, used in education, media, and official communication. The dataset transcriptions are written in Boko.
Vowels
Hausa has five vowel qualities, each occurring in short and long variants, giving 10 monophthongs. Long vowels are marked with a macron in linguistic notation but are represented by double letters in some scientific conventions; the standard orthography does not distinguish them. There are also diphthongs: /ai/, /au/, and /ui/.
Short vowels: a, e, i, o, u Long vowels: aa, ee, ii, oo, uu (or ā, ē, ī, ō, ū in linguistic notation)
Vowel length is phonemically contrastive (e.g., baki 'mouth' vs. baaki 'black').
Consonants
Hausa has 32 consonants. The standard Boko orthography includes several special characters to represent sounds not found in English. Key features include:
Implosives: ɓ (hooked b), ɗ (hooked d) — produced with inward airflow
Ejective: ƙ (hooked k) — produced with outward ejection of air
Glottal stop: represented by an apostrophe (') in standard orthography
Semi-vowel with laryngealisation: ʼy (in Nigeria) or ƴ (in Niger)
Palatal digraph: ky, gy (standard Hausa/Kano) vs. ch, j (Ghanaian Hausa)
Labialized consonants: kw, gw (produced with lip rounding)
Rhotic distinction: two r sounds — a flap [ɽ] (written r) and a trill [r̃] — though this distinction is not marked in the standard orthography and not made by all speakers
Digraphs: sh, ts, gy, ky (Nigeria); and ch (Niger/Ghana)
The full consonant inventory in Boko (Nigerian standard) includes: b, ɓ, d, ɗ, f, g, gy, h, j, k, ky, kw, l, m, n, r, s, sh, t, ts, w, y, ʼy, z as well as their ejective and labialized variants.
Tone System
Hausa is a tonal language with three tones:
High tone (H): left unmarked in standard orthography; marked with acute accent in linguistic notation (á, é, ó, etc.)
Low tone (L): marked with a grave accent in linguistic notation (à, è, ò, etc.)
Falling tone (HL): a combination of high and low on a single syllable; marked with a circumflex in linguistic notation (â, ê, ô, etc.)
Tone is phonemically contrastive and grammatically significant. For example:
Bàaba (LH) = Father
Baabà (HL) = Mother
Baabaa (HH) = Indigo
In the standard Boko orthography, tone is not marked. This is also the case in the transcriptions contained in this dataset. In linguistic, pedagogical, and scientific works, tone is typically indicated using diacritics.
Syllable Structure
Hausa has three syllable types: CV (light), CVV (where VV is a long vowel or diphthong), and CVC (heavy). Consonant clusters do not occur within a syllable, but may appear at syllable boundaries. Gemination (consonant doubling) is a distinctive phonological feature.
2. Ajami (Arabic-derived script)
Hausa has been written in Ajami since at least the early 17th century. The earliest known Hausa text, Riwayar Nabi Musa by Abdullahi Suka, dates from the 17th century. Ajami was historically used for Islamic poetry, religious documents, and scholarly writing, and was the dominant Hausa writing system until the mid-20th century. It retains cultural and historical significance and is still used for poetry and some religious publications. There is no standardized Ajami orthography, and spelling varies across writers and regions. The Arabic script used in the Kano region follows a style sometimes referred to as Sudani Kufi or Rubutun Kano. Tone is not marked in Ajami.
The dataset transcriptions are written exclusively in Boko.
Source
The textual material in this dataset originates from written sources in Hausa covering informational and encyclopaedic content, and obtained from the Indigenous Blogs archive (https://indigenousblogs.com/ha/). The texts were segmented into short utterances suitable for read speech and used as prompts for audio recording sessions. The speaker recorded the utterances in the Kano dialect (Kananci).
Domain
This dataset is derived from prompted read speech. The speaker read aloud pre-written Hausa texts drawn from informational and encyclopaedic sources. The content covers a range of general topics.
The dataset has been structured as segmented, read-style speech suitable for speech synthesis tasks.
Size
The dataset is composed of 19 folders containing audio clips and corresponding mapping files amounting to 296.70 MB.
Each folder contains between 17 and 190 audio files. Individual audio clips typically range from 1 to 23 seconds in duration.
Folder-level durations range from approximately 3 minutes and 40 seconds to over 37 minutes of audio.
The dataset represents a total of 1,962 audio files with a combined duration of approximately 5 hours 25 minutes and 39 seconds of segmented Hausa speech.
A detailed breakdown of durations and file counts per folder is provided below.
| Folder | Files | Duration |
|---|---|---|
| hausa_asr_dataset_01_98clips_1254s_20260329-1516 | 98 | 10m 53s |
| tts_hausa_dataset08_48clips_624s_20260331-1951 | 48 | 7m 04s |
| tts_hausa_dataset_01_190clips_2520s_20260405-1306 | 190 | 35m 39s |
| tts_hausa_dataset_02_51clips_786s_20260330-1510 | 51 | 6m 42s |
| tts_hausa_dataset_03_35clips_583s_20260330-1542 | 35 | 5m 19s |
| tts_hausa_dataset_04_63clips_769s_20260330-1617 | 63 | 8m 07s |
| tts_hausa_dataset_05_47clips_517s_20260331-0308 | 47 | 5m 51s |
| tts_hausa_dataset_05_47clips_517s_20260331-0308 2 | 47 | 5m 51s |
| tts_hausa_dataset_05_67clips_1029s_20260330-2100 | 67 | 9m 32s |
| tts_hausa_dataset_07_156clips_2073s_20260331-1850 | 156 | 21m 22s |
| tts_hausa_dataset_08_152clips_1864s_20260401-1849 | 152 | 19m 32s |
| tts_hausa_dataset_10_94clips_1152s_20260401-1954 | 94 | 13m 20s |
| tts_hausa_dataset_10_94clips_1152s_20260401-1954/attempts | 43 | 5m 39s |
| tts_hausa_dataset_11_184clips_3201s_20260402-0641 | 184 | 35m 59s |
| tts_hausa_dataset_12_139clips_2290s_20260402-1046 | 139 | 25m 03s |
| tts_hausa_dataset_13_174clips_2704s_20260402-1509 | 174 | 33m 57s |
| tts_hausa_dataset_14_175clips_3124s_20260403-0432 | 175 | 37m 21s |
| tts_hausa_dataset_15_17clips_284s_20260403-0447 | 17 | 3m 40s |
| tts_hausa_dataset_17_182clips_2820s_20260405-1612 | 182 | 34m 39s |
| GRAND TOTAL | 1,962 | 5h 25m 39s |
Structure
Each folder in the dataset contains:
A collection of audio files in MP3 format
A tab-separated mapping file linking each audio file to its transcription
Each line in the mapping file follows the format:
audio_filename.mp3 key transcription attempts
The dataset is designed for TTS pipelines requiring paired audio-text data.
Sample
3dbd901747aa48d8ab24acde847e185f.mp3 | ya tabbatar a yau, su ake kira da suna
ed6e5bbef21507302aafe038c79863c2.mp3 | a harshen mutanen kasar Finland. Don haka kamfanin ya dauki wannan suna shahararre,
7c6ef6e100392dfca4b0833746267121.mp3 | ya baiwa kamfaninsa. A halin yanzu Hedikwatar kamfanin na wani gari ne mai suna
c238ef6507a9bfc027cb6dce010c3ebe.mp3 | Espoo, gab da birnin Helsinki.
e984c99d3e2bb75febe051e40799b790.mp3 | Wannan kamfani ya ci gaba da kera takalman danko da kayayyakin lantarki,