Future-proofing Gbagyi: A community centered approach

License icon

License:

CC-BY-NC-SA-4.0

Shield icon

Steward:

NaijaVoices (Lanfrica Labs)

Task: NLP

Release Date: 12/5/2025

Format: WAV

Size: 18.88 GB


Share

Description

This dataset comprises 360 audio recordings of the Gbagyi language, comprising approximately 7 hours 52 minutes of speech data, with paired transcripts.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

Use under the default license (non-commercial) is allowed only for academic, educational, or personal purposes (i.e. non-commercial use) If you want to use the dataset (or derivatives) commercially, you must obtain a proper commercial waiver from NaijaVoices. Reach out at info@naijavoices.com Any published work or product using the dataset must give proper attribution to the dataset creators, including the NaijaVoices community — e.g., citing their paper. You must comply with all applicable data-protection / privacy laws in handling the dataset and metadata (e.g. the regulations relevant under the donor’s jurisdiction) and be transparent about your use. Use must be ethical: you cannot use the dataset in a way that perpetuates stereotypes or biases about any group or community. Do not use the dataset in ways that misrepresent, appropriate, or misuse cultural identities or expressions — e.g. ,avoid misuse that mis-frames cultural content for profit or manipulation.

Forbidden Usage

You must not attempt to identify or reveal the real identities of the voice donors (speakers) in the dataset. Voice cloning or creating high-fidelity replicas of individual speakers (i.e. voice cloning) is explicitly prohibited. You may not use the dataset to build or train systems that generate hate speech, discriminatory language, or content that targets groups in harmful ways. You may not use the dataset for surveillance, intrusive monitoring, or any privacy-violating applications. Using the dataset to manipulate political discourse, influence elections, or perform political propaganda is forbidden. It is forbidden to repurpose the dataset (or derivative datasets) to create another dataset that is “substantially similar in content, structure or purpose” for commercial redistribution or sale — i.e. you cannot re-host or resell the dataset or derived dataset commercially. Using the dataset to generate violent, inciting or hateful content — or content promoting violence/aggression — is prohibited.

Metadata

Dataset Card for Future-proofing Gbagyi: A community centered approach

Overview

This dataset contains audio recordings of the Gbagyi language, a Niger-Congo language spoken primarily in Nigeria. The dataset was collected by Gamaniel Adeyemi as part of the 2025 NaijaVoices Micro-Grants Heritage project.

Dataset Statistics

Total Recordings

  • 360 audio recordings with corresponding transcript files

  • Audio files are stored in the audios folder and can be retrieved by the audio file names in the metadata.csv.

Audio Duration

  • Total duration: ~7 hours 52 minutes (28,327 seconds)

  • Average duration per recording: ~1 minute 19 seconds (78.7 seconds)

  • Duration range:

    • Minimum: ~1 minute 1 second (60.7 seconds)

    • Maximum: ~3 minutes 10 seconds (190.4 seconds)

Speakers

  • 12 unique speakers contributing to the dataset

  • Speaker distribution:

    • GB1: 30 recordings (8.3%)

    • GB2: 30 recordings (8.3%)

    • GB3: 30 recordings (8.3%)

    • GB4: 30 recordings (8.3%)

    • GB5: 30 recordings (8.3%)

    • GB6: 30 recordings (8.3%)

    • GB7: 30 recordings (8.3%)

    • GB8: 30 recordings (8.3%)

    • GB9: 30 recordings (8.3%)

    • GB10: 30 recordings (8.3%)

    • GB11: 30 recordings (8.3%)

    • GB12: 30 recordings (8.3%)

Gender Distribution

  • Female speakers: 180 recordings (50.0%)

  • Male speakers: 180 recordings (50.0%)

Age Range Distribution

  • 18-29 years: 210 recordings (58.3%)

  • 30-over years: 120 recordings (33.3%)

  • 6-17 years: 30 recordings (8.3%)

Geographic and Linguistic Information

  • Country: All recordings are from Nigeria (360 recordings)

  • Language: Gbagyi (360 recordings)

  • Location: Chafuyi, Abuja (360 recordings)

File Structure

Each recording consists of:

  • Audio file (.wav format) organized in the audios folder

  • Corresponding transcript entry in the metadata CSV file

  • Metadata entry in metadata.csv with speaker information and file paths

Metadata Fields

The metadata CSV file includes the following fields:

  • Speaker ID: Unique identifier for each speaker (GB1 through GB12)

  • Audio Name: Name of the audio file. Use this with the audios folder where they are housed to load the audios

  • Transcript: Text transcript of the audio recording

  • Gender: Gender of the speaker (Male/Female)

  • Age range: Age range of the speaker (6-17, 18-29, 30-over)

  • Country Of recording: Country where recording was made (Nigeria)

  • Language: Language spoken (Gbagyi)

  • Location of local community: Specific location (Chafuyi, Abuja)

Metadata CSV Snapshot

Below is a sample of the metadata CSV file showing the structure and content:

Speaker IDAudio NameTranscript (truncated)GenderAge rangeCountry Of recordingLanguageLocation of local community
GB1GB1_F_001.wavAgbagyi je snu da yi yi ba agbagyi je snu da ba na kwolo ba na nye lolo ngwan ba bubuyi n zhyizhyi, adoho lolo nye fi e n [ n n] nsafaje...Female30-overNigeriaGbagyiChafuyi, Abuja
GB1GB1_F_030.wavNyagi a zhi n na bah boh zhi nye n na bah boh ye kwa gye ka n gya Amula a lu awishaka okwonu zhi nyagi a bah gnagye nge okwo nu shi...Female30-overNigeriaGbagyiChafuyi, Abuja
GB2GB2_F_030.wavAnya n na gye ba n ha lo Gbagyi gnikwo okwo nu zhi ashaknu ashaknu na mi na mwi yin n ha lo Gbagyi gnikwo ho zhi ha gye kwo ho to zhi...Female6-17NigeriaGbagyiChafuyi, Abuja
GB3GB3_F_030.wavGamin a yikwo nugey Shekwoagami, anyikwoza oye nu n'na Gbagyi tu n sa'hoyi n ho bmo gni kwo tugey hoyi bmwa obyi kwo doho n Shekwo'a ga ho...Male18-29NigeriaGbagyiChafuyi, Abuja

Note: Transcripts are shown truncated for display purposes. Full transcripts are available in the metadata CSV file.

Transcript Annotation

Square Brackets for Repetitions

The transcripts in this dataset use square brackets [] to capture repetitions that occur in the audio recordings. This annotation convention is used to preserve the natural speech patterns present in the spoken language.

Important Note: Approximately 28.6% of recordings (103 out of 360) contain square bracket annotations in their transcripts. These brackets indicate:

  • Repetitions: Words or phrases that are repeated in the audio (e.g., [n n], [a a], [ntna boh])

  • Disfluencies: Natural speech patterns including hesitations, corrections, or repeated sounds

  • Verbatim transcription: The brackets ensure that the transcript accurately reflects what was actually spoken, including all repetitions and self-corrections

Example from the dataset:

...adoho lolo nye fi e n [ n n] nsafaje toh anyikoza...
...tuko na bah zhiyin[bah zhiyin] yinya pah ko nge...