MDC Logo

Mozilla Data Collective is rebuilding the AI data ecosystem with communities at the centre. Access over 300 high-quality global datasets, built by and for the community in a transparent and ethical way.

Hero Line

Datasets

Institute of African Digital Humanities

Ewondo-French Parallel Corpus

This dataset is a parallel corpus of Ewondo to French texts. Text were obtained by transcription of raw audio files. Translation were added to enrich the ori...

Gear IconTask: MT

Folder IconFormat: TSV

License IconLicense: NOODL-1.0

Size: 137.84 KB

Calendar IconCreated: 11/8/2025

Globe IconLocale: ewo, fr

Open Home Foundation

Dimitar 1.0

Text to speech dataset for Bulgarian, male speaker, approximately 2 hours of read speech.

Gear IconTask: TTS

Folder IconFormat: WEBM

License IconLicense: CC0-1.0

Size: 109.58 MB

Calendar IconCreated: 11/7/2025

Globe IconLocale: bg-BG

Tamahi Suneha Magazine

Punjabi Literature Corpus

This corpus contains 10,39,430 tokens of Punjabi Shahmukhi script.

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 1.83 MB

Calendar IconCreated: 11/7/2025

Globe IconLocale: pa-PK

Sujaak Adbi Sangat

Saraiki Quarterly Magazine Wasson Wehray Corpus

This corpus contains 11,79,200 tokens, from a Saraiki Quarterly Magazine "Wasson Wehray"

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 2.09 MB

Calendar IconCreated: 11/7/2025

Globe IconLocale: skr

Rana Printers Multan

Urdu Literature Corpus

This corpus contains 16,82,700 tokens of multiple Urdu language books.

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 3.00 MB

Calendar IconCreated: 11/7/2025

Globe IconLocale: ur

Bismillah Graphics Publishers

Urdu Literature Corpus

This corpus contains 16,17,074 tokens of multiple Urdu literature books published by Bismillah Graphics Publishers.

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 2.86 MB

Calendar IconCreated: 11/7/2025

Globe IconLocale: ur

Kaleem Art Press

Urdu Literature Corpus

This corpus contains multiple Urdu Language books of Stories, Short Stories, Novel, Travelogues, Poetry, Biography, Literature, History and other literary da...

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 2.85 MB

Calendar IconCreated: 11/6/2025

Globe IconLocale: ur

Kaleem Art Press

Saraiki Literature Corpus

This contains multiple Saraiki Language books of Stories, Short Stories, Novel, Travelogue, Sentences and collection of articles.

Gear IconTask: OTH

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-4.0

Size: 1.84 MB

Calendar IconCreated: 11/6/2025

Globe IconLocale: skr

Unknown Organization

Podcast Hari Minggoean (Indonesia)

This dataset is derived from the "Hari Minggoean" podcast, featuring over ten hours of recorded speech from a single, consistent speaker. The content, tailor...

Gear IconTask: ASR

Folder IconFormat: mp3

License IconLicense: CC-BY-SA-4.0

Size: 338.92 MB

Calendar IconCreated: 11/5/2025

Globe IconLocale: id-ID

Kaltepetlahtol

Tetelancingo Nahuatl

Audio del Nahuatl de la Sierra Oeste de Puebla, transcrito, ortográficamente normalizado, traducido, y etiquetado

Gear IconTask: ASR

Folder IconFormat: .tsv, .wav

License IconLicense: CC-BY-NC-4.0

Size: 952.98 MB

Calendar IconCreated: 11/4/2025

Globe IconLocale: nhi

Common Voice

Mozilla Common Voice Spontaneous Speech ASR Shared Task Train/Dev Data

This datasheet is for the bundle of Mozilla Common Voice spontaneous speech datasets to be used in the Shared Task on Spontaneous Speech.

Gear IconTask: ASR

Folder IconFormat: mp3

License IconLicense: CC0-1.0

Size: 4.30 GB

Calendar IconCreated: 9/25/2025

Globe IconLocale: mul

Common Voice

Common Voice Spontaneous Speech 1.0 - Papantla Totonac

A collection of spontaneous spoken phrases in Papantla Totonac.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 205.70 MB

Calendar IconCreated: 9/15/2025

Globe IconLocale: top

Line Logo
Line Logo

JOIN THE MOVEMENT

Join Mozilla Data Collective

Community members showing peace signs and smiling

Mozilla Data Collective wants to radically reimagine our data as power. We are anti-extractivism, anti-monopoly and deeply, profoundly pro-people. We are a collective of linguists, technologists, activists, researchers and creatives who want AI to be all it promises to be - not all it threatens to be. Here, you can share your datasets on your own terms.

FAQs

Find answers quickly

What is Mozilla Data Collective?

Mozilla Data Collective is a platform in the truest sense. It’s yours to stand on, and make of it what you will. We have dual roots in two Mozilla projects - Common Voice, a CC0 public dataset to help tech speak your language - and the Data Futures Lab - an experimental space for instigating new approaches to data stewardship challenges. Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it.


How does it work?

We partner with organizations and individuals to make their data available through Mozilla Data Collective. You can share openly, using existing licenses like Creative Commons, or you can build your own. You can open up your data for everyone, or just for some types of downloaders, you can set custom constraints, ask for exchange, compensation or recognition. You can govern it as an individual, a co-operative, a trust or something else. After all, it’s your data. The people who access your datasets are authenticated, and held in legally binding contracts, and we have a number of dataset protection features. If you are interested in hosting data on Mozilla Data Collective, please reach out to us at mozilladatacollective@mozillafoundation.org.


Who is behind Mozilla Data Collective?

We are backed and stewarded by Mozilla Foundation - the non-profit, movement-building, and philanthropy arm of Mozilla.