Ehugbo TTS: biblical text to speech dataset in Ehugbo Language

License icon

License:

CC-BY-NC-SA-4.0

Shield icon

Steward:

NaijaVoices (Lanfrica Labs)

Task: TTS

Release Date: 11/27/2025

Format: WAV

Size: 437.69 MB


Share

Description

This dataset contains audio recordings of Bible verses in Ehugbo, a dialect of Igbo (a Niger-Congo language spoken in Nigeria). It contains 312 audio recordings of biblical text-to-speech data comprising 1 hour and 30 seconds of speech data.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

Use under the default license (non-commercial) is allowed only for academic, educational, or personal purposes (i.e. non-commercial use) If you want to use the dataset (or derivatives) commercially, you must obtain a proper commercial waiver from NaijaVoices. Reach out at info@naijavoices.com Any published work or product using the dataset must give proper attribution to the dataset creators, including the Nigeria Bible Translation Trust (NBTT) and the NaijaVoices community. You must comply with all applicable data-protection / privacy laws in handling the dataset and metadata (e.g. the regulations relevant under the donor’s jurisdiction) and be transparent about your use. Use must be ethical: you cannot use the dataset in a way that perpetuates stereotypes or biases about any group or community. Do not use the dataset in ways that misrepresent, appropriate, or misuse cultural identities or expressions — e.g. ,avoid misuse that mis-frames cultural content for profit or manipulation.

Forbidden Usage

You must not attempt to identify or reveal the real identities of the voice donors (speakers) in the dataset. Voice cloning or creating high-fidelity replicas of individual speakers (i.e. voice cloning) is explicitly prohibited. You may not use the dataset to build or train systems that generate hate speech, discriminatory language, or content that targets groups in harmful ways. You may not use the dataset for surveillance, intrusive monitoring, or any privacy-violating applications. Using the dataset to manipulate political discourse, influence elections, or perform political propaganda is forbidden. It is forbidden to repurpose the dataset (or derivative datasets) to create another dataset that is “substantially similar in content, structure or purpose” for commercial redistribution or sale — i.e. you cannot re-host or resell the dataset or derived dataset commercially. Using the dataset to generate violent, inciting or hateful content — or content promoting violence/aggression — is prohibited.

Processes

Intended Use

This dataset can be used for speech processing, including text-to-speech, and ASR.

Metadata

Dataset Card for Ehugbo TTS: biblical text to speech dataset in Ehugbo (Igbo dialect)

Overview

This dataset contains audio recordings of Bible verses in Ehugbo, a dialect of Igbo (a Niger-Congo language spoken in Nigeria). The dataset was curated by Ukachi Agnes Eze-Mbey as part of the 2025 NaijaVoices Micro-Grants Heritage project.

This dataset contains verse-by-verse Bible translations, where each recording corresponds to a specific Bible verse. The recordings are segmented at the verse level, making this dataset particularly valuable for fine-grained linguistic analysis, verse-level alignment studies, and Bible translation research. The dataset covers six books from the New Testament: Acts of Apostles, Ephesians, Galatians, John, Revelations, and Romans.

Note: This dataset includes explicit permission from the Nigeria Bible Translation Trust to use excerpts of the Ehugbo New Testament Bible, as documented in the copyright approval letter.

Dataset Statistics

Total Recordings

  • 312 audio recordings with corresponding transcript files

  • Total duration: 01:00:30 (1 hour, 0 minutes, 30 seconds)

Speakers

  • 6 unique speakers contributing to the dataset

Gender Distribution

  • Female speakers: 221 recordings (70.8%)

  • Male speakers: 91 recordings (29.2%)

Age Distribution

  • 18 years: 62 recordings (19.9%)

  • 27 years: 72 recordings (23.1%)

  • 43 years: 51 recordings (16.3%)

  • 54 years: 36 recordings (11.5%)

  • 63 years: 59 recordings (18.9%)

  • 64 years: 32 recordings (10.3%)

Geographic and Linguistic Information

  • Country: All recordings are from Nigeria (312 recordings)

  • Language: All 312 recordings are in Ehugbo (a dialect of Igbo)

  • Note on Native Language: One speaker (Speaker 6) has Ikwere listed as their native language in the metadata, but all recordings in this dataset are in Ehugbo. The "Native Language" field in the metadata indicates the speaker's first language, not the language of the recording.

Bible Books Distribution

The dataset contains verse-level recordings from six New Testament books:

  • Acts of Apostles: 72 recordings (23.1%)

  • Ephesians: 36 recordings (11.5%)

  • Galatians: 59 recordings (18.9%)

  • John: 51 recordings (16.3%)

  • Revelations: 62 recordings (19.9%)

  • Romans: 32 recordings (10.3%)

File Structure

Each recording consists of:

  • Audio file (.wav format) organized by Bible book in subdirectories:

    • audios/Acts of Apostles/

    • audios/Ephesians/

    • audios/Galatians/

    • audios/John/

    • audios/Revelations/

    • audios/Romans/

  • Corresponding transcript entry in the metadata CSV file

  • Metadata entry in metadata.csv with speaker information, Bible chapter/verse references, and audio duration

Metadata Fields

The metadata CSV file includes the following fields:

  • Pseudo ID: Unique identifier for each speaker (1 through 6)

  • Age: Age of the speaker in years

  • Gender: Gender of the speaker (Male/Female)

  • Native Language: Native/first language of the speaker (Ehugbo or Ikwere). Note: All recordings in this dataset are in Ehugbo, regardless of the speaker's native language.

  • How Fluent are you with speaking Ehugbo?: Self-reported fluency level (1-5 scale)

  • Bible Chapter Read: Bible chapter reference (e.g., "Galatians Chapter 1,2, 3 vs 1-14")

  • Audio File Name (.wav): Name of the audio file (e.g., "Galatians 1_1")

  • Transcript: Text transcript of the Bible verse in the target language

  • Book (audio folder): Bible book name used for directory organization

  • Duration: Audio duration in seconds (rounded to 3 decimal places)

Metadata CSV Snapshot

Below is a sample of the metadata CSV file showing the structure and content:

Pseudo IDAgeGenderNative LanguageHow Fluent are you with speaking Ehugbo?Bible Chapter Read (truncated)Audio File Name (.wav)Transcript (truncated)Book (audio folder)Duration
163MaleEhugbo5Galatians Chapter 1,2, 3 vs 1-14Galatians 1_1Ọ bụ mụoni bụ Pọl, onyeozi nke Onyenwoayị, eehigi na ẹka madụ ma ọ bụgụ ụmụ madụ...Galatians10.974
163MaleEhugbo5Galatians Chapter 1,2, 3 vs 1-14Galatians 1_2mụa ụmụne m na ohu na ime Karaịs mụa wo nọ, na-edejeri, Nde chọchị na ohu nọ na Galeshia...Galatians7.236
264MaleEhugbo4Romans Chapter 1Romans 1_1Ọ bụ m Pọl, nwaọrụ Karaịs Jisọs, onye Chineke kuru ịbụ onyeozi Karaịs a họtarị izise oziọma...Romans11.527
354Female