Join | Mozilla Data Collective

Mozilla Data Collective is rebuilding the AI data ecosystem with communities at the centre. Access over 300 high-quality global datasets, built by and for the community in a transparent and ethical way.

Datasets

Mozilla Data Collective

Sermon-Malaysian-English

7 minutes of Malaysian-accented English speech

License: CC-BY-NC-4.0

Locale: en-MY

Task: ASR

Format: MP4, TXT, SRT

Size: 6.63 MB

Christine

Reading Recommendations List

A reading recommendations list of mainly fiction (fantasy, literary, mystery) books read 2022-2025.

License: CC0-1.0

Locale: en-US

Task: OTH

Format: CSV

Size: 16.24 KB

OpenCSG

chinese-cosmopedia

A large-scale high-quality Chinese text dataset developed by OpenCSG, containing ~15 million entries (≈60B tokens) covering multi-domain content (encyclopedia, education, etc.). Cleaned and deduplicated to remove low-quality content, it is optimized for large language model pretraining, text generation, and other Chinese NLP downstream tasks, compatible with mainstream toolchains (Hugging Face Datasets, PyTorch).

License: Apache-2.0

Locale: zh

Task: LLM

Format: parquet

Size: 6.09 GB

OpenCSG

smoltalk-chinese

SmolTalk-Chinese: A multi-task Chinese conversational dataset covering 19 typical dialogue task scenarios.

License: Apache-2.0

Locale: zh

Task: LLM

Format: parquet

Size: 879.81 MB

Common Voice

Common Voice v24 English - en-AU subset for Everything Open 2026

Common Voice v24 English filtered on the `accent` field for Australian-related accents.

License: CC0-1.0

Locale: en-AU

Task: ASR

Format: CSV, MP3

Size: 1.92 GB

Institute of African Digital Humanities

Adamawa Fulfulde - French Parallel Corpus of Narratives 1.0

Version 1.0 of the Adamawa Fulfulde–French Parallel Corpus of Narratives comprises 1,977 lines of Adamawa Fulfulde narratives and their French translations.

License: NOODL-1.0

Locale: fub

Task: MT

Format: TSV

Size: 112.50 KB

Institute of African Digital Humanities

Ewondo_Fong_ALCAM-MultimodalDataset

A multimodal linguistic resource comprising a curated datasheet of example sentences in Ewondo (Fong variety) and their French equivalents, along with their corresponding audio recordings and a sentence–audio alignment file. It is designed to support research, documentation and pedagogy in the field of speech and language technology for under-resourced African languages.

License: NOODL-1.0

Locale: ewo

Task: NLP

Format: MP3, TSV

Size: 16.80 MB

Amnesia

Informes de Actividades InfoCDMX (Ponencia Laura Enríquez)

El conjunto de datos se compone de los informes anuales de actividades y resultados del InfoCDMX, los cuales documentan el desempeño del organismo garante, el pleno y las ponencias de las personas comisionadas. Estos documentos son instrumentos fundamentales de rendición de cuentas que detallan la gestión institucional, la actividad cuasi-jurisdiccional (resolución de recursos de revisión y denuncias), y las

License: CC-BY-4.0

Locale: es-MX

Task: NLP

Format: PDF, XSLX

Size: 275.85 MB

Amnesia

Ficha de Documentación de Datos: Resoluciones InfoNL (Ponencia F. Guajardo)

Este conjunto de datos documenta la actividad resolutiva de la ponencia del Consejero Francisco Guajardo Martínez dentro del órgano garante de transparencia de Nuevo León. Cubre un periodo significativo de gestión (identificado preliminarmente entre 2018 y 2025), reflejando las disputas entre ciudadanos (solicitantes de información) y sujetos obligados (gobierno).

License: CC-BY-4.0

Locale: es-MX

Task: NLP

Format: PDF, XSLX

Size: 1.07 GB

ComparIA

Compar:IA conversations

French conversational AI conversation and preference dataset with 396K conversations from 50+ LLMs

License: Etalab 2.0

Locale: fr

Task: NLG

Format: PARQUET

Size: 1.81 GB

MEDIAMEN

English–Punjabi (Shahmukhi) Parallel Sentences Corpus (Mediamen Archives)

This parallel sentences corpus containing 30,405 aligned sentence pairs with a total of approximately 0.62 million tokens, curated from the archival materials of Mediamen (Advertising Agency). The sentences were professionally translated from English into Punjabi (Shahmukhi) and are intended to support machine translation, linguistic research, and Punjabi language technology development, particularly for real-world and contemporary language use.

License: CC-BY-NC-4.0

Locale: en-PK, pnb

Task: MT

Format: CSV

Size: 1.08 MB

Fundación Vía Libre

HESEIA Sentence Bias Dataset

This repository contains a dataset collected during the teacher training course HESEIA Sentence Bias (Tools for Exploring Biases and Artificial Intelligence). organized by Vía Libre, the Ministry of Education, and FAMAF-UNC. The course had an initial enrollment of 370 participating teachers, who also involved over 5,000 students in building a dataset that reflects stereotypes present in Argentina.

License: CC-BY-SA-4.0

Locale: es-AR

Task: OTH

Format: CSV

Size: 235.43 KB

JOIN THE MOVEMENT

Join Mozilla Data Collective

Community members showing peace signs and smiling

JOIN THE MOVEMENT

Join Mozilla Data Collective

Mozilla Data Collective wants to radically reimagine our data as power. We are anti-extractivism, anti-monopoly and deeply, profoundly pro-people. We are a collective of linguists, technologists, activists, researchers and creatives who want AI to be all it promises to be - not all it threatens to be. Here, you can share your datasets on your own terms.

FAQs

Find answers quickly

What is Mozilla Data Collective?

Mozilla Data Collective is a platform in the truest sense. It’s yours to stand on, and make of it what you will. We have dual roots in two Mozilla projects - Common Voice, a CC0 public dataset to help tech speak your language - and the Data Futures Lab - an experimental space for instigating new approaches to data stewardship challenges. Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it.

How does it work?

We partner with organizations and individuals to make their data available through Mozilla Data Collective. You can share openly, using existing licenses like Creative Commons, or you can build your own. You can open up your data for everyone, or just for some types of downloaders, you can set custom constraints, ask for exchange, compensation or recognition. You can govern it as an individual, a co-operative, a trust or something else. After all, it’s your data. The people who access your datasets are authenticated, and held in legally binding contracts, and we have a number of dataset protection features. If you are interested in hosting data on Mozilla Data Collective, please reach out to us at mozilladatacollective@mozillafoundation.org.

Who is behind Mozilla Data Collective?

We are backed and stewarded by Mozilla Foundation - the non-profit, movement-building, and philanthropy arm of Mozilla.