Future-proofing Gbagyi: A community centered approach

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

Use under the default license (non-commercial) is allowed only for academic, educational, or personal purposes (i.e. non-commercial use) If you want to use the dataset (or derivatives) commercially, you must obtain a proper commercial waiver from NaijaVoices. Reach out at info@naijavoices.com Any published work or product using the dataset must give proper attribution to the dataset creators, including the NaijaVoices community — e.g., citing their paper. You must comply with all applicable data-protection / privacy laws in handling the dataset and metadata (e.g. the regulations relevant under the donor’s jurisdiction) and be transparent about your use. Use must be ethical: you cannot use the dataset in a way that perpetuates stereotypes or biases about any group or community. Do not use the dataset in ways that misrepresent, appropriate, or misuse cultural identities or expressions — e.g. ,avoid misuse that mis-frames cultural content for profit or manipulation.

Forbidden Usage

You must not attempt to identify or reveal the real identities of the voice donors (speakers) in the dataset. Voice cloning or creating high-fidelity replicas of individual speakers (i.e. voice cloning) is explicitly prohibited. You may not use the dataset to build or train systems that generate hate speech, discriminatory language, or content that targets groups in harmful ways. You may not use the dataset for surveillance, intrusive monitoring, or any privacy-violating applications. Using the dataset to manipulate political discourse, influence elections, or perform political propaganda is forbidden. It is forbidden to repurpose the dataset (or derivative datasets) to create another dataset that is “substantially similar in content, structure or purpose” for commercial redistribution or sale — i.e. you cannot re-host or resell the dataset or derived dataset commercially. Using the dataset to generate violent, inciting or hateful content — or content promoting violence/aggression — is prohibited.

Metadata

Dataset Card for Future-proofing Gbagyi: A community centered approach

Overview

This dataset contains audio recordings of the Gbagyi language, a Niger-Congo language spoken primarily in Nigeria. The dataset was collected by Gamaniel Adeyemi as part of the 2025 NaijaVoices Micro-Grants Heritage project.

Dataset Statistics

Total Recordings

360 audio recordings with corresponding transcript files
Audio files are stored in the audios folder and can be retrieved by the audio file names in the metadata.csv.

Audio Duration

Total duration: ~7 hours 52 minutes (28,327 seconds)
Average duration per recording: ~1 minute 19 seconds (78.7 seconds)
Duration range:
- Minimum: ~1 minute 1 second (60.7 seconds)
- Maximum: ~3 minutes 10 seconds (190.4 seconds)

Speakers

12 unique speakers contributing to the dataset
Speaker distribution:
- GB1: 30 recordings (8.3%)
- GB2: 30 recordings (8.3%)
- GB3: 30 recordings (8.3%)
- GB4: 30 recordings (8.3%)
- GB5: 30 recordings (8.3%)
- GB6: 30 recordings (8.3%)
- GB7: 30 recordings (8.3%)
- GB8: 30 recordings (8.3%)
- GB9: 30 recordings (8.3%)
- GB10: 30 recordings (8.3%)
- GB11: 30 recordings (8.3%)
- GB12: 30 recordings (8.3%)

Gender Distribution

Female speakers: 180 recordings (50.0%)
Male speakers: 180 recordings (50.0%)

Age Range Distribution

18-29 years: 210 recordings (58.3%)
30-over years: 120 recordings (33.3%)
6-17 years: 30 recordings (8.3%)

Geographic and Linguistic Information

Country: All recordings are from Nigeria (360 recordings)
Language: Gbagyi (360 recordings)
Location: Chafuyi, Abuja (360 recordings)

File Structure

Each recording consists of:

Audio file (.wav format) organized in the audios folder
Corresponding transcript entry in the metadata CSV file
Metadata entry in metadata.csv with speaker information and file paths

Metadata Fields

The metadata CSV file includes the following fields:

Speaker ID: Unique identifier for each speaker (GB1 through GB12)
Audio Name: Name of the audio file. Use this with the audios folder where they are housed to load the audios
Transcript: Text transcript of the audio recording
Gender: Gender of the speaker (Male/Female)
Age range: Age range of the speaker (6-17, 18-29, 30-over)
Country Of recording: Country where recording was made (Nigeria)
Language: Language spoken (Gbagyi)
Location of local community: Specific location (Chafuyi, Abuja)

Metadata CSV Snapshot

Below is a sample of the metadata CSV file showing the structure and content:

Speaker ID	Audio Name	Transcript (truncated)	Gender	Age range	Country Of recording	Language	Location of local community
GB1	GB1_F_001.wav	Agbagyi je snu da yi yi ba agbagyi je snu da ba na kwolo ba na nye lolo ngwan ba bubuyi n zhyizhyi, adoho lolo nye fi e n [ n n] nsafaje...	Female	30-over	Nigeria	Gbagyi	Chafuyi, Abuja
GB1	GB1_F_030.wav	Nyagi a zhi n na bah boh zhi nye n na bah boh ye kwa gye ka n gya Amula a lu awishaka okwonu zhi nyagi a bah gnagye nge okwo nu shi...	Female	30-over	Nigeria	Gbagyi	Chafuyi, Abuja
GB2	GB2_F_030.wav	Anya n na gye ba n ha lo Gbagyi gnikwo okwo nu zhi ashaknu ashaknu na mi na mwi yin n ha lo Gbagyi gnikwo ho zhi ha gye kwo ho to zhi...	Female	6-17	Nigeria	Gbagyi	Chafuyi, Abuja
GB3	GB3_F_030.wav	Gamin a yikwo nugey Shekwoagami, anyikwoza oye nu n'na Gbagyi tu n sa'hoyi n ho bmo gni kwo tugey hoyi bmwa obyi kwo doho n Shekwo'a ga ho...	Male	18-29	Nigeria	Gbagyi	Chafuyi, Abuja

Note: Transcripts are shown truncated for display purposes. Full transcripts are available in the metadata CSV file.

Transcript Annotation

Square Brackets for Repetitions

The transcripts in this dataset use square brackets [] to capture repetitions that occur in the audio recordings. This annotation convention is used to preserve the natural speech patterns present in the spoken language.

Important Note: Approximately 28.6% of recordings (103 out of 360) contain square bracket annotations in their transcripts. These brackets indicate:

Repetitions: Words or phrases that are repeated in the audio (e.g., [n n], [a a], [ntna boh])
Disfluencies: Natural speech patterns including hesitations, corrections, or repeated sounds
Verbatim transcription: The brackets ensure that the transcript accurately reflects what was actually spoken, including all repetitions and self-corrections

Example from the dataset:

...adoho lolo nye fi e n [ n n] nsafaje toh anyikoza...
...tuko na bah zhiyin[bah zhiyin] yinya pah ko nge...