MDC Logo

Mozilla Data Collective is rebuilding the AI data ecosystem with communities at the centre. Access over 470+ high-quality global datasets, built by and for the community in a transparent and ethical way.

Datasets

Community

Bojonegoro Javanese TTS

This dataset contains synthetic speech data covering a variety of everyday-life topics in the Bojonegoro dialect of Javanese, spoken in East Java, Indonesia.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: TTS

Format Icon

Format: .tar.gz, WEBM

Size Icon

Size: 469.50 MB

MIT

ATLAS Cross-Lingual Transfer Matrix

The Cross-Lingual Transfer Matrix from "ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining".
License Icon

License: Apache-2.0

Locale Icon

Locale: en-US

Task Icon

Task: NLP

Format Icon

Format: CSV

Size Icon

Size: 2.36 KB

Kaltepetlahtol

Zacatlán Tepetzintla Nahuatl ASR Dataset

A 14 hour ASR dataset of Nahuatl from Zacatlán and Tepetzintla. Derived from Amith et al (2026)´'s field recordings and transcriptions datasets
License Icon

License: CC-BY-ND-4.0

Locale Icon

Locale: nhi

Task Icon

Task: ASR

Format Icon

Format: FLAC, TSV

Size Icon

Size: 789.98 MB

Taruen

Kyrgyz Folklore Text Corpus

A 427k-word Kyrgyz folklore corpus of tales, proverbs, and aphorisms, digitized from 5 Bishkek academic volumes (2016-2017) for NLP tasks.
License Icon

License: CC0-1.0

Locale Icon

Locale: ky

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.28 MB

OpenCSG

Finweb-Edu-Chinese-v2.2

Fineweb-Edu-Chinese v2.2: Updated Chinese educational web dataset (Fineweb series) — access via www.opencsg.com.
License Icon

License: Apache-2.0

Locale Icon

Locale: zh

Task Icon

Task: LLM

Format Icon

Format: parquet

Size Icon

Size: 624.68 MB

Community

Manggarai Language for NLP

The dataset consists of responses to various prompts written in the Manggarai language. These responses were subsequently read aloud and recorded.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: mqy

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 287.61 MB

Taruen

World Factbook (JSON)

A machine-readable JSON archive of the CIA World Factbook (Jan 2026 snapshot). Includes both standard developer and raw cache versions with image metadata.
License Icon

License: CC0-1.0

Locale Icon

Locale: en

Task Icon

Task: NLP

Format Icon

Format: JSON

Size Icon

Size: 7.10 MB

Balochi Academy

Eastern Balochi Literature Corpus

A UTF-8 normalized Eastern Balochi literature corpus (~1.9M tokens) covering poetry, folklore, novels, and cultural texts for linguistic research and NLP.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: bgp

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 949.67 KB

Nick Fox-Gieg

ABC-Draco

A GLTF Draco conversion of the NYU ABC-Dataset.
License Icon

License: Onshape

Locale Icon

Locale: en-US

Task Icon

Task: CV

Format Icon

Format: GLTF with Draco compression

Size Icon

Size: 43.32 GB

Universidad Nacional Autónoma de México, UNAM

Trabajo de Campo - Huave

Un corpus de audio anotado de la región de San Mateo del Mar, Oaxaca, una lengua de comunidades originarias de México.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: huv

Task Icon

Task: ASR

Format Icon

Format: MP3, TSV

Size Icon

Size: 538.25 MB

Forum for Language Initiatives

Gojri Literature Corpus

A curated Gojri (Gujari) text corpus of approximately 60K tokens covering poetry, stories, short stories, and literary prose.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: gju

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 117.97 KB

Forum for Language Initiatives

Khowar Literature Corpus by FLI

A multi-genre Khowar language corpus designed for linguistic research, NLP applications, and cultural documentation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: khw

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 244.85 KB

IT'S EASY TO UPLOAD & CONTROL YOUR DATA

Upload your dataset

An illustration of a floppy disks

Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it. You can share openly, using existing licenses, or you can build your own.

An illustration of a floppy disks

JOIN THE MOVEMENT

Join Mozilla Data Collective

Community members showing peace signs and smiling

Mozilla Data Collective wants to radically reimagine our data as power. We are anti-extractivism, anti-monopoly and deeply, profoundly pro-people. We are a collective of linguists, technologists, activists, researchers and creatives who want AI to be all it promises to be - not all it threatens to be. Here, you can share your datasets on your own terms.

FAQs

Find answers quickly

What is Mozilla Data Collective?

Mozilla Data Collective is a platform in the truest sense. It’s yours to stand on, and make of it what you will. We have dual roots in two Mozilla projects - Common Voice, a CC0 public dataset to help tech speak your language - and the Data Futures Lab - an experimental space for instigating new approaches to data stewardship challenges. Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it.


How does it work?

We partner with organizations and individuals to make their data available through Mozilla Data Collective. You can share openly, using existing licenses like Creative Commons, or you can build your own. You can open up your data for everyone, or just for some types of downloaders, you can set custom constraints, ask for exchange, compensation or recognition. You can govern it as an individual, a co-operative, a trust or something else. After all, it’s your data. The people who access your datasets are authenticated, and held in legally binding contracts, and we have a number of dataset protection features. If you are interested in hosting data on Mozilla Data Collective, please reach out to us at mozilladatacollective@mozillafoundation.org.


Who is behind Mozilla Data Collective?

We are backed and stewarded by Mozilla Foundation - the non-profit, movement-building, and philanthropy arm of Mozilla.