Speech Data Collection for The Nupe Language

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

Use under the default license (non-commercial) is allowed only for academic, educational, or personal purposes (i.e. non-commercial use) If you want to use the dataset (or derivatives) commercially, you must obtain a proper commercial waiver from NaijaVoices. Reach out at info@naijavoices.com Any published work or product using the dataset must give proper attribution to the dataset creators, including the NaijaVoices community — e.g., citing their paper. You must comply with all applicable data-protection / privacy laws in handling the dataset and metadata (e.g. the regulations relevant under the donor’s jurisdiction) and be transparent about your use. Use must be ethical: you cannot use the dataset in a way that perpetuates stereotypes or biases about any group or community. Do not use the dataset in ways that misrepresent, appropriate, or misuse cultural identities or expressions — e.g. ,avoid misuse that mis-frames cultural content for profit or manipulation.

Forbidden Usage

You must not attempt to identify or reveal the real identities of the voice donors (speakers) in the dataset. Voice cloning or creating high-fidelity replicas of individual speakers (i.e. voice cloning) is explicitly prohibited. You may not use the dataset to build or train systems that generate hate speech, discriminatory language, or content that targets groups in harmful ways. You may not use the dataset for surveillance, intrusive monitoring, or any privacy-violating applications. Using the dataset to manipulate political discourse, influence elections, or perform political propaganda is forbidden. It is forbidden to repurpose the dataset (or derivative datasets) to create another dataset that is “substantially similar in content, structure or purpose” for commercial redistribution or sale — i.e. you cannot re-host or resell the dataset or derived dataset commercially. Using the dataset to generate violent, inciting or hateful content — or content promoting violence/aggression — is prohibited.

Metadata

Dataset Card for Speech Data Collection for The Nupe Language

Overview

This dataset contains audio recordings of the Nupe language, a Niger-Congo language spoken primarily in Nigeria. The dataset was collected by Umar Baba Umar as part of the 2025 NaijaVoices Micro-Grants Heritage project.

Dataset Statistics

Total Recordings

1,583 audio recordings with corresponding transcript files
Total duration: 02:40:32 (2 hours, 40 minutes, 32 seconds)
Average duration per recording: 7.87 seconds

Speakers

8 unique speakers contributing to the dataset
Speaker distribution:
- Speaker_id_1: 274 recordings (17.3%)
- Speaker_id_4: 347 recordings (21.9%)
- Speaker_id_3: 311 recordings (19.6%)
- Speaker_id_6: 213 recordings (13.5%)
- Speaker_id_7: 146 recordings (9.2%)
- Speaker_id_5: 120 recordings (7.6%)
- Speaker_id_2: 87 recordings (5.5%)
- Speaker_id_2 2: 85 recordings (5.4%)

Accent Distribution

The dataset represents three main Nupe accent varieties:

Bida accent: 777 recordings (49.1%)
Kutigi accent: 487 recordings (30.8%)
Lapai accent: 318 recordings (20.1%)

Age Range Distribution

25-30 years: 731 recordings (46.2%)
20-24 years: 346 recordings (21.9%)
35-40 years: 213 recordings (13.5%)
45-50 years: 172 recordings (10.9%)
30-35 years: 120 recordings (7.6%)

Gender Distribution

Female speakers: 923 recordings (58.3%)
Male speakers: 659 recordings (41.6%)

Geographic and Linguistic Information

Nationality: All speakers are from Nigeria (1,582 valid recordings)
Language: Nupe (1,582 valid recordings)

File Structure

Each recording consists of:

Audio file (.wav format)
Corresponding transcript file (.txt format)
Metadata entry in Metadata.csv with speaker information, file paths, and audio duration

Metadata Fields

The metadata CSV includes the following fields:

Speaker_ID: Unique identifier for each speaker
Transcript_File_Path: Relative path to the transcript file
Audio_File_Path: Relative path to the audio file
Accent: Nupe accent variety (Bida, Kutigi, or Lapai)
Age_Range: Age range of the speaker
Gender: Gender of the speaker (Male/Female)
Nationality: Country of origin (Nigeria)
Languages: Language spoken (Nupe)