Hausa-TTS-Dataset

License icon

License:

NOODL-1.0

Shield icon

Steward:

Institute of African Digital Humanities

Task: TTS

Release Date: 4/7/2026

Format: MP3, TSV

Size: 276.90 MB


Share

Description

This dataset comprises audio recordings of Hausa speech aligned with textual transcriptions. The dataset is structured into 19 folders, each containing audio files and a corresponding audio-text mapping file. The audio clips are short, typically ranging from 1 to 23 seconds, and are suitable for training and evaluating Text-to-Speech (TTS) systems. The dataset follows a structured format where each audio file is paired with its corresponding transcription in a tab-separated mapping file. The textual content used in this dataset originates from a variety of written sources in Hausa, including encyclopaedic and informational texts. These texts were segmented into short utterances suitable for read speech and TTS modelling.

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

- For research and scientific use only - You agree not to re-host or redistribute this dataset

Forbidden Usage

You agree not to use the data for: - Generative AI - Voice cloning or speaker imitation - Reproduction, duplication, modification, or redistribution - Commercial use without explicit permission

Processes

Intended Use

This dataset is intended for the training and evaluation of Text-to-Speech (TTS) systems for the Hausa language. It aims to support: - Language technology development for one of Africa's most widely spoken languages - Development of speech technologies for under-served African language communities - Educational applications in multilingual contexts - Research in low-resource and African language speech synthesis

Metadata

Language

Hausa (native name: Hausa / هَوْسَ; Halshen Hausa / هَرْشَن هَوْسَ) is a West Chadic language of the Afro-Asiatic language family. It is the most widely spoken language in the Chadic branch and one of the largest languages on the African continent. Hausa is spoken as a first language by approximately 50–80 million people, and as a second or trade language by an additional 20–50 million, with total speaker estimates ranging from 72 million to 150 million depending on the source. It is the majority language of much of northern Nigeria and the neighbouring Republic of Niger, and serves as an important lingua franca for trade and commerce across West and Central Africa. Hausa is also spoken in communities across Cameroon, Chad, Ghana, Togo, Benin, Sudan, and Côte d'Ivoire, as well as in diaspora communities in Europe and North America.

Hausa is recognized as an indigenous national language in the constitutions of both Nigeria and Niger. In Nigeria, it is the de facto official language of several northern states, used in education, the media, and public life. In Niger, it was declared an official language in 2025.

Variants

Hausa presents a wide degree of uniformity across the regions where it is spoken, and all dialects are mutually intelligible. However, linguists have identified a number of geographically distinct dialect areas, broadly grouped into Eastern, Western, Northern, and Southern varieties.

Eastern Hausa dialects include:

  • Kananci, spoken in Kano and surrounding areas — the basis of Standard Hausa

  • Dauranci, spoken in Daura

  • Bausanci (Gudduranci), spoken in Bauchi and Katagum

  • Hadejanci, spoken in Hadejiya

Western Hausa dialects include:

  • Sakkwatanci, spoken in Sokoto — also known as Classical Hausa, used in a rich tradition of classical Hausa literature

  • Katsinanci, spoken in Katsina — transitional between Eastern and Western dialects

  • Kabanci, spoken in Kebbi

  • Zanhwaranci, spoken in Zamfara

  • Gobiranci, spoken in Gobir

  • Adaranci, spoken in Ader (Niger)

  • Arewanci, spoken in Dogondoutchi (Niger)

  • Kurfayanci, spoken in Kourfeye (Niger)

  • Damagaranci, spoken in Damagaram/Zinder (Niger)

  • Tibiranci, spoken in Maradi (Niger)

Southern Hausa dialects include:

  • Zazzaganci, spoken in Zaria and the Zazzau region — a major southern dialect that has contributed to innovations in written and spoken Hausa

Other dialects:

  • Gaananci, spoken in Ghana, Togo, and Mali — a distinct western native Hausa dialect bloc, characterized by the use of /c/ for /ky/ and /j/ for /gy/

The variant represented in this dataset is Kananci, the dialect of Kano, the largest commercial city in the Hausa-speaking world. Kananci is the basis of Standard Hausa (Daidaitacciyar Hausa), which is used in nearly all printed materials including newspapers, school textbooks, and broadcast media. International broadcasters such as the BBC, Deutsche Welle, Radio France Internationale, and Voice of America use Standard Hausa (Kananci/Dauranci) for their Hausa services.

Key phonological features of Kananci that distinguish it from other dialects include the pronunciation of /u/ in words such as sauka ('descend') and zauna ('sit down'), where other dialects (particularly Western Hausa) retain the consonants /b/, /f/, and /m/ before other consonants (sabka, zamna). Kananci also consistently maintains gender distinctions in all nouns, marking masculine nouns with ne and feminine nouns with ce.

Writing System

The Hausa language has two writing systems: Boko (the standard romanized script) and Ajami (an Arabic-derived script).

1. Boko (Romanized script)

Boko was developed during the 19th century and standardized through the Pan-Nigerian Alphabet in the 1980s. It is the dominant writing system for Hausa today, used in education, media, and official communication. The dataset transcriptions are written in Boko.

Vowels

Hausa has five vowel qualities, each occurring in short and long variants, giving 10 monophthongs. Long vowels are marked with a macron in linguistic notation but are represented by double letters in some scientific conventions; the standard orthography does not distinguish them. There are also diphthongs: /ai/, /au/, and /ui/.

Short vowels: a, e, i, o, u Long vowels: aa, ee, ii, oo, uu (or ā, ē, ī, ō, ū in linguistic notation)

Vowel length is phonemically contrastive (e.g., baki 'mouth' vs. baaki 'black').

Consonants

Hausa has 32 consonants. The standard Boko orthography includes several special characters to represent sounds not found in English. Key features include:

  • Implosives: ɓ (hooked b), ɗ (hooked d) — produced with inward airflow

  • Ejective: ƙ (hooked k) — produced with outward ejection of air

  • Glottal stop: represented by an apostrophe (') in standard orthography

  • Semi-vowel with laryngealisation: ʼy (in Nigeria) or ƴ (in Niger)

  • Palatal digraph: ky, gy (standard Hausa/Kano) vs. ch, j (Ghanaian Hausa)

  • Labialized consonants: kw, gw (produced with lip rounding)

  • Rhotic distinction: two r sounds — a flap [ɽ] (written r) and a trill [r̃] — though this distinction is not marked in the standard orthography and not made by all speakers

  • Digraphs: sh, ts, gy, ky (Nigeria); and ch (Niger/Ghana)

The full consonant inventory in Boko (Nigerian standard) includes: b, ɓ, d, ɗ, f, g, gy, h, j, k, ky, kw, l, m, n, r, s, sh, t, ts, w, y, ʼy, z as well as their ejective and labialized variants.

Tone System

Hausa is a tonal language with three tones:

  • High tone (H): left unmarked in standard orthography; marked with acute accent in linguistic notation (á, é, ó, etc.)

  • Low tone (L): marked with a grave accent in linguistic notation (à, è, ò, etc.)

  • Falling tone (HL): a combination of high and low on a single syllable; marked with a circumflex in linguistic notation (â, ê, ô, etc.)

Tone is phonemically contrastive and grammatically significant. For example:

  • Bàaba (LH) = Father

  • Baabà (HL) = Mother

  • Baabaa (HH) = Indigo

In the standard Boko orthography, tone is not marked. This is also the case in the transcriptions contained in this dataset. In linguistic, pedagogical, and scientific works, tone is typically indicated using diacritics.

Syllable Structure

Hausa has three syllable types: CV (light), CVV (where VV is a long vowel or diphthong), and CVC (heavy). Consonant clusters do not occur within a syllable, but may appear at syllable boundaries. Gemination (consonant doubling) is a distinctive phonological feature.

2. Ajami (Arabic-derived script)

Hausa has been written in Ajami since at least the early 17th century. The earliest known Hausa text, Riwayar Nabi Musa by Abdullahi Suka, dates from the 17th century. Ajami was historically used for Islamic poetry, religious documents, and scholarly writing, and was the dominant Hausa writing system until the mid-20th century. It retains cultural and historical significance and is still used for poetry and some religious publications. There is no standardized Ajami orthography, and spelling varies across writers and regions. The Arabic script used in the Kano region follows a style sometimes referred to as Sudani Kufi or Rubutun Kano. Tone is not marked in Ajami.

The dataset transcriptions are written exclusively in Boko.

Source

The textual material in this dataset originates from written sources in Hausa covering informational and encyclopaedic content, and obtained from the Indigenous Blogs archive (https://indigenousblogs.com/ha/). The texts were segmented into short utterances suitable for read speech and used as prompts for audio recording sessions. The speaker recorded the utterances in the Kano dialect (Kananci).

Domain

This dataset is derived from prompted read speech. The speaker read aloud pre-written Hausa texts drawn from informational and encyclopaedic sources. The content covers a range of general topics.

The dataset has been structured as segmented, read-style speech suitable for speech synthesis tasks.

Size

The dataset is composed of 19 folders containing audio clips and corresponding mapping files amounting to 296.70 MB.

Each folder contains between 17 and 190 audio files. Individual audio clips typically range from 1 to 23 seconds in duration.

Folder-level durations range from approximately 3 minutes and 40 seconds to over 37 minutes of audio.

The dataset represents a total of 1,962 audio files with a combined duration of approximately 5 hours 25 minutes and 39 seconds of segmented Hausa speech.

A detailed breakdown of durations and file counts per folder is provided below.

FolderFilesDuration
hausa_asr_dataset_01_98clips_1254s_20260329-15169810m 53s
tts_hausa_dataset08_48clips_624s_20260331-1951487m 04s
tts_hausa_dataset_01_190clips_2520s_20260405-130619035m 39s
tts_hausa_dataset_02_51clips_786s_20260330-1510516m 42s
tts_hausa_dataset_03_35clips_583s_20260330-1542355m 19s
tts_hausa_dataset_04_63clips_769s_20260330-1617638m 07s
tts_hausa_dataset_05_47clips_517s_20260331-0308475m 51s
tts_hausa_dataset_05_47clips_517s_20260331-0308 2475m 51s
tts_hausa_dataset_05_67clips_1029s_20260330-2100679m 32s
tts_hausa_dataset_07_156clips_2073s_20260331-185015621m 22s
tts_hausa_dataset_08_152clips_1864s_20260401-184915219m 32s
tts_hausa_dataset_10_94clips_1152s_20260401-19549413m 20s
tts_hausa_dataset_10_94clips_1152s_20260401-1954/attempts435m 39s
tts_hausa_dataset_11_184clips_3201s_20260402-064118435m 59s
tts_hausa_dataset_12_139clips_2290s_20260402-104613925m 03s
tts_hausa_dataset_13_174clips_2704s_20260402-150917433m 57s
tts_hausa_dataset_14_175clips_3124s_20260403-043217537m 21s
tts_hausa_dataset_15_17clips_284s_20260403-0447173m 40s
tts_hausa_dataset_17_182clips_2820s_20260405-161218234m 39s
GRAND TOTAL1,9625h 25m 39s

Structure

Each folder in the dataset contains:

  • A collection of audio files in MP3 format

  • A tab-separated mapping file linking each audio file to its transcription

Each line in the mapping file follows the format:

audio_filename.mp3 key transcription attempts

The dataset is designed for TTS pipelines requiring paired audio-text data.

Sample

  1. 3dbd901747aa48d8ab24acde847e185f.mp3 | ya tabbatar a yau, su ake kira da suna

  2. ed6e5bbef21507302aafe038c79863c2.mp3 | a harshen mutanen kasar Finland. Don haka kamfanin ya dauki wannan suna shahararre,

  3. 7c6ef6e100392dfca4b0833746267121.mp3 | ya baiwa kamfaninsa. A halin yanzu Hedikwatar kamfanin na wani gari ne mai suna

  4. c238ef6507a9bfc027cb6dce010c3ebe.mp3 | Espoo, gab da birnin Helsinki.

  5. e984c99d3e2bb75febe051e40799b790.mp3 | Wannan kamfani ya ci gaba da kera takalman danko da kayayyakin lantarki,