Atyap Afwan_: Preserving Tyap Through Community-Driven Speech Data
License:
CC-BY-NC-SA-4.0
Steward:
NaijaVoices (Lanfrica Labs)Task: NLP
Release Date: 12/2/2025
Format: WAV, TXT
Size: 251.51 MB
Share
Description
This dataset contains 98 recordings (≈1.16 hours) of everyday Tyap speech from 10 community speakers, each paired with detailed transcripts and English translations.
Specifics
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
Use under the default license (non-commercial) is allowed only for academic, educational, or personal purposes (i.e. non-commercial use) If you want to use the dataset (or derivatives) commercially, you must obtain a proper commercial waiver from NaijaVoices. Reach out at info@naijavoices.com Any published work or product using the dataset must give proper attribution to the dataset creators, including the NaijaVoices community — e.g., citing their paper. You must comply with all applicable data-protection / privacy laws in handling the dataset and metadata (e.g. the regulations relevant under the donor’s jurisdiction) and be transparent about your use. Use must be ethical: you cannot use the dataset in a way that perpetuates stereotypes or biases about any group or community. Do not use the dataset in ways that misrepresent, appropriate, or misuse cultural identities or expressions — e.g. ,avoid misuse that mis-frames cultural content for profit or manipulation.
Forbidden Usage
You must not attempt to identify or reveal the real identities of the voice donors (speakers) in the dataset. Voice cloning or creating high-fidelity replicas of individual speakers (i.e. voice cloning) is explicitly prohibited. You may not use the dataset to build or train systems that generate hate speech, discriminatory language, or content that targets groups in harmful ways. You may not use the dataset for surveillance, intrusive monitoring, or any privacy-violating applications. Using the dataset to manipulate political discourse, influence elections, or perform political propaganda is forbidden. It is forbidden to repurpose the dataset (or derivative datasets) to create another dataset that is “substantially similar in content, structure or purpose” for commercial redistribution or sale — i.e. you cannot re-host or resell the dataset or derived dataset commercially. Using the dataset to generate violent, inciting or hateful content — or content promoting violence/aggression — is prohibited.
Metadata
Overview
This dataset contains audio recordings of the Tyap language, a Niger-Congo language spoken primarily in Nigeria. The dataset was collected by Raphael Musa Afana as part of the NaijaVoices Micro-Grants Heritage project.
Dataset Statistics
Total Recordings
98 audio recordings with corresponding transcript and translation files
Audio files are organized by speaker in directories (TYAP1 through TYAP10)
Each recording has a corresponding transcript file and translation file
Audio Duration
Total duration: ~1 hour 9 minutes 47 seconds (4,186.66 seconds)
Average duration per recording: ~42.3 seconds
Duration range:
Minimum: ~7.7 seconds
Maximum: ~2 minutes 37 seconds (156.9 seconds)
Speakers
10 unique speakers contributing to the dataset
Speaker distribution:
TYAP1: 16 recordings (16.3%)
TYAP2: 8 recordings (8.2%)
TYAP3: 9 recordings (9.2%)
TYAP4: 10 recordings (10.2%)
TYAP5: 14 recordings (14.3%)
TYAP6: 11 recordings (11.2%)
TYAP7: 5 recordings (5.1%)
TYAP8: 6 recordings (6.1%)
TYAP9: 10 recordings (10.2%)
TYAP10: 9 recordings (9.2%)
Gender Distribution
Female speakers: 45 recordings (45.9%)
Male speakers: 53 recordings (54.1%)
Age Range Distribution
18-29 years: 50 recordings (51.0%)
30-over years: 39 recordings (39.8%)
Unspecified: 9 recordings (9.2%)
Geographic and Linguistic Information
Country: All recordings are from Nigeria (98 recordings)
Language: Tyap (98 recordings)
File Structure
The dataset is organized in the following directory structure:
├── audios/
│ ├── TYAP1/
│ │ ├── Tyap1_F_001.wav
│ │ ├── Tyap1_F_002.wav
│ │ └── ...
│ ├── TYAP2/
│ │ └── ...
│ └── TYAP{1-10}/
│ └── [audio files organized by speaker]
├── transcripts/
│ ├── TYAP1/
│ │ ├── Tyap1_F_001.txt
│ │ ├── Tyap1_F_002.txt
│ │ └── ...
│ ├── TYAP2/
│ │ └── ...
│ └── TYAP{1-10}/
│ └── [transcript files organized by speaker]
├── translations/
│ └── translation for Tyap language/
│ ├── TYAP1/
│ │ ├── Tyap1_F_001.txt
│ │ └── ...
│ └── TYAP{1-10}/
│ └── [translation files organized by speaker]
├── metadata.csv
└── dataset-card.md
Each recording consists of:
Audio file (
.wavformat) organized in speaker-specific directories withinaudios/Transcript file (
.txtformat) containing the original Tyap transcription, organized in speaker-specific directories withintranscripts/Translation file (
.txtformat) containing English translations, organized in speaker-specific directories withintranslations/translation for Tyap language/Metadata entry in
metadata.csvwith speaker information and file references
Text Encoding
All text files (transcripts, translations, and metadata CSV) are saved with UTF-8 with BOM (Byte Order Mark) encoding. This encoding choice is critical for preserving the linguistic integrity of the dataset:
Diacritics and Non-ASCII Characters: The Tyap language uses diacritics and other non-ASCII characters that are essential for accurate representation of the language. UTF-8 with BOM ensures these characters are preserved without loss of detail.
Software Compatibility: The BOM (Byte Order Mark) allows for proper character display when accessing files with Microsoft Excel and other software that may not automatically detect UTF-8 encoding. This ensures that users working with different tools can access the data without character corruption or loss.
When working with these files, it is recommended to use text editors and software that support UTF-8 with BOM encoding to maintain the full linguistic detail of the transcriptions and translations.
Metadata Fields
The metadata CSV file includes the following fields:
SPEAKER ID: Unique identifier for each speaker (TYAP1 through TYAP10)GENDER: Gender of the speaker (Male/Female)AGE RANGE: Age range of the speaker (18-29, 30-over)COUNTRY: Country where recording was made (Nigeria)LANGUAGE: Language spoken (Tyap)AUDIO FILE NAME: Name of the audio file (e.g.,Tyap1_F_001.wav)TRANSCRIPT: Name of the transcript file (e.g.,Tyap1_F_001.txt)TRANSLATION: Name of the translation file (e.g.,Tyap1_F_001.txt)
File Naming Convention
Files follow a consistent naming pattern:
Format:
{SpeakerID}_{Gender}_{SequenceNumber}.{extension}Example:
Tyap1_F_001.wavrepresents the first recording from TYAP1 (Female speaker)Gender indicators:
Ffor Female,Mfor MaleSequence numbers are zero-padded (001, 002, etc.)