Atyap Afwan_: Preserving Tyap Through Community-Driven Speech Data

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

Use under the default license (non-commercial) is allowed only for academic, educational, or personal purposes (i.e. non-commercial use) If you want to use the dataset (or derivatives) commercially, you must obtain a proper commercial waiver from NaijaVoices. Reach out at info@naijavoices.com Any published work or product using the dataset must give proper attribution to the dataset creators, including the NaijaVoices community — e.g., citing their paper. You must comply with all applicable data-protection / privacy laws in handling the dataset and metadata (e.g. the regulations relevant under the donor’s jurisdiction) and be transparent about your use. Use must be ethical: you cannot use the dataset in a way that perpetuates stereotypes or biases about any group or community. Do not use the dataset in ways that misrepresent, appropriate, or misuse cultural identities or expressions — e.g. ,avoid misuse that mis-frames cultural content for profit or manipulation.

Forbidden Usage

You must not attempt to identify or reveal the real identities of the voice donors (speakers) in the dataset. Voice cloning or creating high-fidelity replicas of individual speakers (i.e. voice cloning) is explicitly prohibited. You may not use the dataset to build or train systems that generate hate speech, discriminatory language, or content that targets groups in harmful ways. You may not use the dataset for surveillance, intrusive monitoring, or any privacy-violating applications. Using the dataset to manipulate political discourse, influence elections, or perform political propaganda is forbidden. It is forbidden to repurpose the dataset (or derivative datasets) to create another dataset that is “substantially similar in content, structure or purpose” for commercial redistribution or sale — i.e. you cannot re-host or resell the dataset or derived dataset commercially. Using the dataset to generate violent, inciting or hateful content — or content promoting violence/aggression — is prohibited.

Metadata

Overview

This dataset contains audio recordings of the Tyap language, a Niger-Congo language spoken primarily in Nigeria. The dataset was collected by Raphael Musa Afana as part of the NaijaVoices Micro-Grants Heritage project.

Dataset Statistics

Total Recordings

98 audio recordings with corresponding transcript and translation files
Audio files are organized by speaker in directories (TYAP1 through TYAP10)
Each recording has a corresponding transcript file and translation file

Audio Duration

Total duration: ~1 hour 9 minutes 47 seconds (4,186.66 seconds)
Average duration per recording: ~42.3 seconds
Duration range:
- Minimum: ~7.7 seconds
- Maximum: ~2 minutes 37 seconds (156.9 seconds)

Speakers

10 unique speakers contributing to the dataset
Speaker distribution:
- TYAP1: 16 recordings (16.3%)
- TYAP2: 8 recordings (8.2%)
- TYAP3: 9 recordings (9.2%)
- TYAP4: 10 recordings (10.2%)
- TYAP5: 14 recordings (14.3%)
- TYAP6: 11 recordings (11.2%)
- TYAP7: 5 recordings (5.1%)
- TYAP8: 6 recordings (6.1%)
- TYAP9: 10 recordings (10.2%)
- TYAP10: 9 recordings (9.2%)

Gender Distribution

Female speakers: 45 recordings (45.9%)
Male speakers: 53 recordings (54.1%)

Age Range Distribution

18-29 years: 50 recordings (51.0%)
30-over years: 39 recordings (39.8%)
Unspecified: 9 recordings (9.2%)

Geographic and Linguistic Information

Country: All recordings are from Nigeria (98 recordings)
Language: Tyap (98 recordings)

File Structure

The dataset is organized in the following directory structure:

├── audios/
│   ├── TYAP1/
│   │   ├── Tyap1_F_001.wav
│   │   ├── Tyap1_F_002.wav
│   │   └── ...
│   ├── TYAP2/
│   │   └── ...
│   └── TYAP{1-10}/
│       └── [audio files organized by speaker]
├── transcripts/
│   ├── TYAP1/
│   │   ├── Tyap1_F_001.txt
│   │   ├── Tyap1_F_002.txt
│   │   └── ...
│   ├── TYAP2/
│   │   └── ...
│   └── TYAP{1-10}/
│       └── [transcript files organized by speaker]
├── translations/
│   └── translation for Tyap language/
│       ├── TYAP1/
│       │   ├── Tyap1_F_001.txt
│       │   └── ...
│       └── TYAP{1-10}/
│           └── [translation files organized by speaker]
├── metadata.csv
└── dataset-card.md

Each recording consists of:

Audio file (.wav format) organized in speaker-specific directories within audios/
Transcript file (.txt format) containing the original Tyap transcription, organized in speaker-specific directories within transcripts/
Translation file (.txt format) containing English translations, organized in speaker-specific directories within translations/translation for Tyap language/
Metadata entry in metadata.csv with speaker information and file references

Text Encoding

All text files (transcripts, translations, and metadata CSV) are saved with UTF-8 with BOM (Byte Order Mark) encoding. This encoding choice is critical for preserving the linguistic integrity of the dataset:

Diacritics and Non-ASCII Characters: The Tyap language uses diacritics and other non-ASCII characters that are essential for accurate representation of the language. UTF-8 with BOM ensures these characters are preserved without loss of detail.
Software Compatibility: The BOM (Byte Order Mark) allows for proper character display when accessing files with Microsoft Excel and other software that may not automatically detect UTF-8 encoding. This ensures that users working with different tools can access the data without character corruption or loss.

When working with these files, it is recommended to use text editors and software that support UTF-8 with BOM encoding to maintain the full linguistic detail of the transcriptions and translations.

Metadata Fields

The metadata CSV file includes the following fields:

SPEAKER ID: Unique identifier for each speaker (TYAP1 through TYAP10)
GENDER: Gender of the speaker (Male/Female)
AGE RANGE: Age range of the speaker (18-29, 30-over)
COUNTRY: Country where recording was made (Nigeria)
LANGUAGE: Language spoken (Tyap)
AUDIO FILE NAME: Name of the audio file (e.g., Tyap1_F_001.wav)
TRANSCRIPT: Name of the transcript file (e.g., Tyap1_F_001.txt)
TRANSLATION: Name of the translation file (e.g., Tyap1_F_001.txt)

File Naming Convention

Files follow a consistent naming pattern:

Format: {SpeakerID}_{Gender}_{SequenceNumber}.{extension}
Example: Tyap1_F_001.wav represents the first recording from TYAP1 (Female speaker)
Gender indicators: F for Female, M for Male
Sequence numbers are zero-padded (001, 002, etc.)