MDC Logo

Mozilla Data Collective is rebuilding the AI data ecosystem with communities at the centre. Access over 300 high-quality global datasets, built by and for the community in a transparent and ethical way.

Hero Line

Datasets

Institute of African Digital Humanities

Suundi-TTS-Dataset

The dataset consists of paired audio and text data on Suundi (sdj), a language spoken in Congo. The audio corpus consists of 4,187 clips read by one speaker ...

Gear IconTask: TTS

Folder IconFormat: WAV, TSV

License IconLicense: NOODL-1.0

Size: 240.50 MB

Calendar IconCreated: 12/11/2025

Globe IconLocale: sdj

Institute of African Digital Humanities

Mbosi-TTS-Dataset

The dataset consists of paired audio and text data on Mbosi (mdw), a language spoken in Congo. The audio corpus consists of 2,575 clips read by one speaker t...

Gear IconTask: TTS

Folder IconFormat: WAV, TSV

License IconLicense: NOODL-1.0

Size: 644.39 MB

Calendar IconCreated: 12/11/2025

Globe IconLocale: mdw

Institute of African Digital Humanities

Beembe-TTS-Dataset

The dataset consists of paired audio and text data on Beembe (beq), a language spoken in Congo. The audio corpus consists of 6,933 clips read by one speaker ...

Gear IconTask: TTS

Folder IconFormat: WAV, TSV

License IconLicense: NOODL-1.0

Size: 861.46 MB

Calendar IconCreated: 12/11/2025

Globe IconLocale: beq

Akylai

KyrgyzLLM-Bench: Kyrgyz LLM Evaluation Dataset

KyrgyzLLM-Bench is a comprehensive suite purpose-built to evaluate LLMs’ deep understanding and reasoning in Kyrgyz. It combines natively authored benchmarks...

Gear IconTask: LLM

Folder IconFormat: PARQUET

License IconLicense: mixed

Size: 87.20 MB

Calendar IconCreated: 12/10/2025

Globe IconLocale: ky

Institute of African Digital Humanities

Yaka-TTS-Dataset

Paired audio and text data on Yaka (also known as West Teke), a language spoken in Congo. The audio corpus consists of 7,648 clips read by one speaker for a ...

Gear IconTask: TTS

Folder IconFormat: WAV, TSV

License IconLicense: NOODL-1.0

Size: 1.26 GB

Calendar IconCreated: 12/10/2025

Globe IconLocale: iyx

Institute of African Digital Humanities

Akoose-ALCAM-MultimodalDataset

This dataset comprises a datasheet of Akoose (bss) lexical entries collected from the 'Western-Bakossi' subgroup. Each entry is accompanied by illustrative s...

Gear IconTask: NLP

Folder IconFormat: MP3, TSV

License IconLicense: NOODL-1.0

Size: 16.05 MB

Calendar IconCreated: 12/10/2025

Globe IconLocale: bss

Institute of African Digital Humanities

Kituba-TTS-Dataset

Paired audio and text data on Kituba (mkw), a language spoken in Congo. The audio corpus consists of 8,302 clips read by one speaker, totalling 350 min 11.98...

Gear IconTask: TTS

Folder IconFormat: WAV, TSV

License IconLicense: NOODL-1.0

Size: 553.28 MB

Calendar IconCreated: 12/10/2025

Globe IconLocale: mkw

Taraaz

Multilingual Humanitarian Response Eval (MHRE)

This multilingual humanitarian dataset contains 655 annotated datapoints evaluating AI chatbot safety and quality in migration and asylum scenarios across fo...

Gear IconTask: LLM

Folder IconFormat: csv

License IconLicense: CC-BY-NC-SA-4.0

Size: 2.15 MB

Calendar IconCreated: 12/8/2025

Globe IconLocale: mul

Forum for Language Initiatives

Hussain Faizy Indus Kohistani Corpus

The Indus Kohistani corpus contains around 500k tokens of folktales, stories, poetry, biographies, and conversational texts, all transcribed with a consiste...

Gear IconTask: NLP

Folder IconFormat: TXT

License IconLicense: CC-BY-SA-4.0

Size: 14.70 MB

Calendar IconCreated: 12/8/2025

Globe IconLocale: mvy

Institute of African Digital Humanities

Ewondo-Yanda-ALCAM-MultimodalDataset

This dataset comprises a datasheet of Ewondo (Ewo) lexical entries collected in the Yanda subgroup. Each entry is accompanied by illustrative sentences, word...

Gear IconTask: NLP

Folder IconFormat: MP3, TSV

License IconLicense: NOODL-1.0

Size: 18.09 MB

Calendar IconCreated: 12/7/2025

Globe IconLocale: ewo

Open Home Foundation

Flemishguy 1.0

Text to speech dataset for Dutch, male speaker, approximately 1 hour of read speech.

Gear IconTask: TTS

Folder IconFormat: FLAC

License IconLicense: CC0-1.0

Size: 73.69 MB

Calendar IconCreated: 12/6/2025

Globe IconLocale: nl-BE

Open Home Foundation

Faber 1.0

Text to speech dataset for Brazilian Portuguese, male speaker, approximately 1.5 hours of read speech.

Gear IconTask: TTS

Folder IconFormat: WEBM

License IconLicense: CC0-1.0

Size: 30.98 MB

Calendar IconCreated: 12/6/2025

Globe IconLocale: pt-BR

Line Logo
Line Logo

JOIN THE MOVEMENT

Join Mozilla Data Collective

Community members showing peace signs and smiling

Mozilla Data Collective wants to radically reimagine our data as power. We are anti-extractivism, anti-monopoly and deeply, profoundly pro-people. We are a collective of linguists, technologists, activists, researchers and creatives who want AI to be all it promises to be - not all it threatens to be. Here, you can share your datasets on your own terms.

FAQs

Find answers quickly

What is Mozilla Data Collective?

Mozilla Data Collective is a platform in the truest sense. It’s yours to stand on, and make of it what you will. We have dual roots in two Mozilla projects - Common Voice, a CC0 public dataset to help tech speak your language - and the Data Futures Lab - an experimental space for instigating new approaches to data stewardship challenges. Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it.


How does it work?

We partner with organizations and individuals to make their data available through Mozilla Data Collective. You can share openly, using existing licenses like Creative Commons, or you can build your own. You can open up your data for everyone, or just for some types of downloaders, you can set custom constraints, ask for exchange, compensation or recognition. You can govern it as an individual, a co-operative, a trust or something else. After all, it’s your data. The people who access your datasets are authenticated, and held in legally binding contracts, and we have a number of dataset protection features. If you are interested in hosting data on Mozilla Data Collective, please reach out to us at mozilladatacollective@mozillafoundation.org.


Who is behind Mozilla Data Collective?

We are backed and stewarded by Mozilla Foundation - the non-profit, movement-building, and philanthropy arm of Mozilla.