Speech Data Collection for The Nupe Language
License:
CC-BY-NC-SA-4.0
Steward:
NaijaVoices (Lanfrica Labs)Task: NLP
Release Date: 11/27/2025
Format: WAV, TXT
Size: 1.58 GB
Share
Description
This dataset contains audio recordings of the Nupe language. It features 1,583 audio recordings comprising 2 hours, 40 minutes, and 32 seconds of speech data, with paired transcripts. The recordings feature 8 unique speakers representing three distinct Nupe accent varieties: Bida accent, Kutigi accent, and Lapai accent.
Specifics
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
Use under the default license (non-commercial) is allowed only for academic, educational, or personal purposes (i.e. non-commercial use) If you want to use the dataset (or derivatives) commercially, you must obtain a proper commercial waiver from NaijaVoices. Reach out at info@naijavoices.com Any published work or product using the dataset must give proper attribution to the dataset creators, including the NaijaVoices community — e.g., citing their paper. You must comply with all applicable data-protection / privacy laws in handling the dataset and metadata (e.g. the regulations relevant under the donor’s jurisdiction) and be transparent about your use. Use must be ethical: you cannot use the dataset in a way that perpetuates stereotypes or biases about any group or community. Do not use the dataset in ways that misrepresent, appropriate, or misuse cultural identities or expressions — e.g. ,avoid misuse that mis-frames cultural content for profit or manipulation.
Forbidden Usage
You must not attempt to identify or reveal the real identities of the voice donors (speakers) in the dataset. Voice cloning or creating high-fidelity replicas of individual speakers (i.e. voice cloning) is explicitly prohibited. You may not use the dataset to build or train systems that generate hate speech, discriminatory language, or content that targets groups in harmful ways. You may not use the dataset for surveillance, intrusive monitoring, or any privacy-violating applications. Using the dataset to manipulate political discourse, influence elections, or perform political propaganda is forbidden. It is forbidden to repurpose the dataset (or derivative datasets) to create another dataset that is “substantially similar in content, structure or purpose” for commercial redistribution or sale — i.e. you cannot re-host or resell the dataset or derived dataset commercially. Using the dataset to generate violent, inciting or hateful content — or content promoting violence/aggression — is prohibited.
Metadata
Dataset Card for Speech Data Collection for The Nupe Language
Overview
This dataset contains audio recordings of the Nupe language, a Niger-Congo language spoken primarily in Nigeria. The dataset was collected by Umar Baba Umar as part of the 2025 NaijaVoices Micro-Grants Heritage project.
Dataset Statistics
Total Recordings
1,583 audio recordings with corresponding transcript files
Total duration: 02:40:32 (2 hours, 40 minutes, 32 seconds)
Average duration per recording: 7.87 seconds
Speakers
8 unique speakers contributing to the dataset
Speaker distribution:
Speaker_id_1: 274 recordings (17.3%)
Speaker_id_4: 347 recordings (21.9%)
Speaker_id_3: 311 recordings (19.6%)
Speaker_id_6: 213 recordings (13.5%)
Speaker_id_7: 146 recordings (9.2%)
Speaker_id_5: 120 recordings (7.6%)
Speaker_id_2: 87 recordings (5.5%)
Speaker_id_2 2: 85 recordings (5.4%)
Accent Distribution
The dataset represents three main Nupe accent varieties:
Bida accent: 777 recordings (49.1%)
Kutigi accent: 487 recordings (30.8%)
Lapai accent: 318 recordings (20.1%)
Age Range Distribution
25-30 years: 731 recordings (46.2%)
20-24 years: 346 recordings (21.9%)
35-40 years: 213 recordings (13.5%)
45-50 years: 172 recordings (10.9%)
30-35 years: 120 recordings (7.6%)
Gender Distribution
Female speakers: 923 recordings (58.3%)
Male speakers: 659 recordings (41.6%)
Geographic and Linguistic Information
Nationality: All speakers are from Nigeria (1,582 valid recordings)
Language: Nupe (1,582 valid recordings)
File Structure
Each recording consists of:
Audio file (
.wavformat)Corresponding transcript file (
.txtformat)Metadata entry in
Metadata.csvwith speaker information, file paths, and audio duration
Metadata Fields
The metadata CSV includes the following fields:
Speaker_ID: Unique identifier for each speakerTranscript_File_Path: Relative path to the transcript fileAudio_File_Path: Relative path to the audio fileAccent: Nupe accent variety (Bida, Kutigi, or Lapai)Age_Range: Age range of the speakerGender: Gender of the speaker (Male/Female)Nationality: Country of origin (Nigeria)Languages: Language spoken (Nupe)