Oro_Word

License:

CC0-1.0

Steward:

Community

Task: TTS

Release Date: 3/24/2026

Format: .WAV, CSV

Size: 1.28 MB

Description

This dataset contains word-level recordings in Afaan Oromoo collected from native speakers to support the development of open-source speech technologies. The dataset is designed for training and evaluating automatic speech recognition (ASR) and text-to-speech (TTS) systems. Each audio file is paired with its corresponding written word and metadata. Afaan Oromoo is a widely spoken Cushitic language in Ethiopia and neighboring regions, but it remains underrepresented in digital language resources. This contribution aims to expand accessible linguistic data, support research and education, and strengthen the presence of Afaan Oromoo in modern AI technologies.

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Restrictions/Special Constraints

This dataset is released for open use. It may be used for research, education, and commercial applications. Users must comply with the dataset license terms and provide appropriate attribution where required.

Forbidden Usage

Users of this dataset agree to the following restrictions: You must not attempt to identify or re-identify any individual speaker. You must not use this dataset to clone voices or create systems that imitate specific speakers. You must not use this dataset for malicious, deceptive, or harmful purposes. You must not use this dataset to generate misleading or fraudulent audio content. Any use that violates privacy rights, human rights, or applicable laws is strictly prohibited.

Processes

Ethical Review

All participants in this dataset were fully informed about the purpose of the data collection, the intended uses of the recordings, and their rights as contributors. Participants voluntarily provided their recordings after giving explicit consent. No personal identifying information was collected, and all data has been anonymized to protect privacy. The data collection process followed ethical standards for research with human subjects, ensuring transparency, voluntary participation, and respect for the dignity and confidentiality of all speakers. Contributors were informed that their recordings would be used solely for developing open-source speech technology and related educational and research purposes.

Intended Use

This dataset is intended for use in developing and evaluating open-source speech technologies for Afaan Oromoo, including automatic speech recognition (ASR) systems, text-to-speech (TTS) systems, and other voice-enabled applications. It is designed to support research, educational purposes, and the creation of accessible digital tools for Afaan Oromoo speakers.

Metadata

The dataset contains isolated word recordings collected using consumer-grade microphones and mobile devices in natural environments. Basic quality checks were performed to remove corrupted or unintelligible audio. File names are mapped to corresponding word transcripts through a metadata table included in the archive. Audio files are provided in WAV format to ensure compatibility with common speech processing tools. This dataset is intended to be expandable, and future versions may include additional speakers, dialect variations, and more vocabulary.