Digital Divide Data

United States
www.digitaldividedata.com/
For Profit

About us

Digital Divide Data is a research and data creation initiative focused on developing high-quality, open-source speech and language datasets for low-resource African languages. Our work supports the development of inclusive technologies—such as Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and machine translation—by creating ethically sourced, community-validated datasets. Our current efforts concentrate on Kamba, Gusii, Luhya, and Somali, where we are collecting thousands of hours of audio and hundreds of thousands of validated sentences per language. We prioritize transparency, data protection, strong community involvement, and open access to resources that can strengthen innovation within the continent. Goals for Sharing Data on Mozilla Data Collective • Promote Open Access: Contribute to a global repository of open datasets that support researchers, developers, and organizations building technology for African languages. • Strengthen Representation: Increase the availability of high-quality data for languages that are traditionally underrepresented in speech and language technologies. • Encourage Ethical Data Creation: Align with Mozilla’s values by making ethically sourced, community-driven datasets freely accessible under open licenses. • Foster Collaboration: Enable partnerships with academic, research, and technology communities working to improve multilingual AI systems. • Advance Local Innovation: Support African organizations, startups, and researchers by providing foundational datasets for building localized AI tools and digital products.

Datasets

1 Dataset


Luhya ASR data subset 70 hours	CC-BY-4.0	luy	ASR	WAV, XLSX	13.90 GB