Speech Data Collection for The Nupe Language

License icon

License:

CC-BY-NC-SA-4.0

Shield icon

Steward:

NaijaVoices (Lanfrica Labs)

Task: NLP

Release Date: 11/27/2025

Format: WAV, TXT

Size: 1.58 GB


Share

Description

This dataset contains audio recordings of the Nupe language. It features 1,583 audio recordings comprising 2 hours, 40 minutes, and 32 seconds of speech data, with paired transcripts. The recordings feature 8 unique speakers representing three distinct Nupe accent varieties: Bida accent, Kutigi accent, and Lapai accent.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

Use under the default license (non-commercial) is allowed only for academic, educational, or personal purposes (i.e. non-commercial use) If you want to use the dataset (or derivatives) commercially, you must obtain a proper commercial waiver from NaijaVoices. Reach out at info@naijavoices.com Any published work or product using the dataset must give proper attribution to the dataset creators, including the NaijaVoices community — e.g., citing their paper. You must comply with all applicable data-protection / privacy laws in handling the dataset and metadata (e.g. the regulations relevant under the donor’s jurisdiction) and be transparent about your use. Use must be ethical: you cannot use the dataset in a way that perpetuates stereotypes or biases about any group or community. Do not use the dataset in ways that misrepresent, appropriate, or misuse cultural identities or expressions — e.g. ,avoid misuse that mis-frames cultural content for profit or manipulation.

Forbidden Usage

You must not attempt to identify or reveal the real identities of the voice donors (speakers) in the dataset. Voice cloning or creating high-fidelity replicas of individual speakers (i.e. voice cloning) is explicitly prohibited. You may not use the dataset to build or train systems that generate hate speech, discriminatory language, or content that targets groups in harmful ways. You may not use the dataset for surveillance, intrusive monitoring, or any privacy-violating applications. Using the dataset to manipulate political discourse, influence elections, or perform political propaganda is forbidden. It is forbidden to repurpose the dataset (or derivative datasets) to create another dataset that is “substantially similar in content, structure or purpose” for commercial redistribution or sale — i.e. you cannot re-host or resell the dataset or derived dataset commercially. Using the dataset to generate violent, inciting or hateful content — or content promoting violence/aggression — is prohibited.

Metadata

Dataset Card for Speech Data Collection for The Nupe Language

Overview

This dataset contains audio recordings of the Nupe language, a Niger-Congo language spoken primarily in Nigeria. The dataset was collected by Umar Baba Umar as part of the 2025 NaijaVoices Micro-Grants Heritage project.

Dataset Statistics

Total Recordings

  • 1,583 audio recordings with corresponding transcript files

  • Total duration: 02:40:32 (2 hours, 40 minutes, 32 seconds)

  • Average duration per recording: 7.87 seconds

Speakers

  • 8 unique speakers contributing to the dataset

  • Speaker distribution:

    • Speaker_id_1: 274 recordings (17.3%)

    • Speaker_id_4: 347 recordings (21.9%)

    • Speaker_id_3: 311 recordings (19.6%)

    • Speaker_id_6: 213 recordings (13.5%)

    • Speaker_id_7: 146 recordings (9.2%)

    • Speaker_id_5: 120 recordings (7.6%)

    • Speaker_id_2: 87 recordings (5.5%)

    • Speaker_id_2 2: 85 recordings (5.4%)

Accent Distribution

The dataset represents three main Nupe accent varieties:

  • Bida accent: 777 recordings (49.1%)

  • Kutigi accent: 487 recordings (30.8%)

  • Lapai accent: 318 recordings (20.1%)

Age Range Distribution

  • 25-30 years: 731 recordings (46.2%)

  • 20-24 years: 346 recordings (21.9%)

  • 35-40 years: 213 recordings (13.5%)

  • 45-50 years: 172 recordings (10.9%)

  • 30-35 years: 120 recordings (7.6%)

Gender Distribution

  • Female speakers: 923 recordings (58.3%)

  • Male speakers: 659 recordings (41.6%)

Geographic and Linguistic Information

  • Nationality: All speakers are from Nigeria (1,582 valid recordings)

  • Language: Nupe (1,582 valid recordings)

File Structure

Each recording consists of:

  • Audio file (.wav format)

  • Corresponding transcript file (.txt format)

  • Metadata entry in Metadata.csv with speaker information, file paths, and audio duration

Metadata Fields

The metadata CSV includes the following fields:

  • Speaker_ID: Unique identifier for each speaker

  • Transcript_File_Path: Relative path to the transcript file

  • Audio_File_Path: Relative path to the audio file

  • Accent: Nupe accent variety (Bida, Kutigi, or Lapai)

  • Age_Range: Age range of the speaker

  • Gender: Gender of the speaker (Male/Female)

  • Nationality: Country of origin (Nigeria)

  • Languages: Language spoken (Nupe)