Future-proofing Gbagyi: A community centered approach
License:
CC-BY-NC-SA-4.0
Steward:
NaijaVoices (Lanfrica Labs)Task: NLP
Release Date: 12/5/2025
Format: WAV
Size: 18.88 GB
Share
Description
This dataset comprises 360 audio recordings of the Gbagyi language, comprising approximately 7 hours 52 minutes of speech data, with paired transcripts.
Specifics
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
Use under the default license (non-commercial) is allowed only for academic, educational, or personal purposes (i.e. non-commercial use) If you want to use the dataset (or derivatives) commercially, you must obtain a proper commercial waiver from NaijaVoices. Reach out at info@naijavoices.com Any published work or product using the dataset must give proper attribution to the dataset creators, including the NaijaVoices community — e.g., citing their paper. You must comply with all applicable data-protection / privacy laws in handling the dataset and metadata (e.g. the regulations relevant under the donor’s jurisdiction) and be transparent about your use. Use must be ethical: you cannot use the dataset in a way that perpetuates stereotypes or biases about any group or community. Do not use the dataset in ways that misrepresent, appropriate, or misuse cultural identities or expressions — e.g. ,avoid misuse that mis-frames cultural content for profit or manipulation.
Forbidden Usage
You must not attempt to identify or reveal the real identities of the voice donors (speakers) in the dataset. Voice cloning or creating high-fidelity replicas of individual speakers (i.e. voice cloning) is explicitly prohibited. You may not use the dataset to build or train systems that generate hate speech, discriminatory language, or content that targets groups in harmful ways. You may not use the dataset for surveillance, intrusive monitoring, or any privacy-violating applications. Using the dataset to manipulate political discourse, influence elections, or perform political propaganda is forbidden. It is forbidden to repurpose the dataset (or derivative datasets) to create another dataset that is “substantially similar in content, structure or purpose” for commercial redistribution or sale — i.e. you cannot re-host or resell the dataset or derived dataset commercially. Using the dataset to generate violent, inciting or hateful content — or content promoting violence/aggression — is prohibited.
Metadata
Dataset Card for Future-proofing Gbagyi: A community centered approach
Overview
This dataset contains audio recordings of the Gbagyi language, a Niger-Congo language spoken primarily in Nigeria. The dataset was collected by Gamaniel Adeyemi as part of the 2025 NaijaVoices Micro-Grants Heritage project.
Dataset Statistics
Total Recordings
360 audio recordings with corresponding transcript files
Audio files are stored in the
audiosfolder and can be retrieved by the audio file names in themetadata.csv.
Audio Duration
Total duration: ~7 hours 52 minutes (28,327 seconds)
Average duration per recording: ~1 minute 19 seconds (78.7 seconds)
Duration range:
Minimum: ~1 minute 1 second (60.7 seconds)
Maximum: ~3 minutes 10 seconds (190.4 seconds)
Speakers
12 unique speakers contributing to the dataset
Speaker distribution:
GB1: 30 recordings (8.3%)
GB2: 30 recordings (8.3%)
GB3: 30 recordings (8.3%)
GB4: 30 recordings (8.3%)
GB5: 30 recordings (8.3%)
GB6: 30 recordings (8.3%)
GB7: 30 recordings (8.3%)
GB8: 30 recordings (8.3%)
GB9: 30 recordings (8.3%)
GB10: 30 recordings (8.3%)
GB11: 30 recordings (8.3%)
GB12: 30 recordings (8.3%)
Gender Distribution
Female speakers: 180 recordings (50.0%)
Male speakers: 180 recordings (50.0%)
Age Range Distribution
18-29 years: 210 recordings (58.3%)
30-over years: 120 recordings (33.3%)
6-17 years: 30 recordings (8.3%)
Geographic and Linguistic Information
Country: All recordings are from Nigeria (360 recordings)
Language: Gbagyi (360 recordings)
Location: Chafuyi, Abuja (360 recordings)
File Structure
Each recording consists of:
Audio file (
.wavformat) organized in theaudiosfolderCorresponding transcript entry in the metadata CSV file
Metadata entry in
metadata.csvwith speaker information and file paths
Metadata Fields
The metadata CSV file includes the following fields:
Speaker ID: Unique identifier for each speaker (GB1 through GB12)Audio Name: Name of the audio file. Use this with theaudiosfolder where they are housed to load the audiosTranscript: Text transcript of the audio recordingGender: Gender of the speaker (Male/Female)Age range: Age range of the speaker (6-17, 18-29, 30-over)Country Of recording: Country where recording was made (Nigeria)Language: Language spoken (Gbagyi)Location of local community: Specific location (Chafuyi, Abuja)
Metadata CSV Snapshot
Below is a sample of the metadata CSV file showing the structure and content:
| Speaker ID | Audio Name | Transcript (truncated) | Gender | Age range | Country Of recording | Language | Location of local community |
|---|---|---|---|---|---|---|---|
| GB1 | GB1_F_001.wav | Agbagyi je snu da yi yi ba agbagyi je snu da ba na kwolo ba na nye lolo ngwan ba bubuyi n zhyizhyi, adoho lolo nye fi e n [ n n] nsafaje... | Female | 30-over | Nigeria | Gbagyi | Chafuyi, Abuja |
| GB1 | GB1_F_030.wav | Nyagi a zhi n na bah boh zhi nye n na bah boh ye kwa gye ka n gya Amula a lu awishaka okwonu zhi nyagi a bah gnagye nge okwo nu shi... | Female | 30-over | Nigeria | Gbagyi | Chafuyi, Abuja |
| GB2 | GB2_F_030.wav | Anya n na gye ba n ha lo Gbagyi gnikwo okwo nu zhi ashaknu ashaknu na mi na mwi yin n ha lo Gbagyi gnikwo ho zhi ha gye kwo ho to zhi... | Female | 6-17 | Nigeria | Gbagyi | Chafuyi, Abuja |
| GB3 | GB3_F_030.wav | Gamin a yikwo nugey Shekwoagami, anyikwoza oye nu n'na Gbagyi tu n sa'hoyi n ho bmo gni kwo tugey hoyi bmwa obyi kwo doho n Shekwo'a ga ho... | Male | 18-29 | Nigeria | Gbagyi | Chafuyi, Abuja |
Note: Transcripts are shown truncated for display purposes. Full transcripts are available in the metadata CSV file.
Transcript Annotation
Square Brackets for Repetitions
The transcripts in this dataset use square brackets [] to capture repetitions that occur in the audio recordings. This annotation convention is used to preserve the natural speech patterns present in the spoken language.
Important Note: Approximately 28.6% of recordings (103 out of 360) contain square bracket annotations in their transcripts. These brackets indicate:
Repetitions: Words or phrases that are repeated in the audio (e.g.,
[n n],[a a],[ntna boh])Disfluencies: Natural speech patterns including hesitations, corrections, or repeated sounds
Verbatim transcription: The brackets ensure that the transcript accurately reflects what was actually spoken, including all repetitions and self-corrections
Example from the dataset:
...adoho lolo nye fi e n [ n n] nsafaje toh anyikoza...
...tuko na bah zhiyin[bah zhiyin] yinya pah ko nge...