Datasets

Filters:
Common Voice

Common Voice Spontaneous Speech 3.0 - Georgian

A collection of spontaneous responses to questions in Georgian.
License Icon

License: CC0-1.0

Locale Icon

Locale: ka

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 11.61 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Rakhine

A collection of spontaneous responses to questions in Rakhine.
License Icon

License: CC0-1.0

Locale Icon

Locale: rki

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 11.20 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Bashkir

A collection of spontaneous responses to questions in Bashkir.
License Icon

License: CC0-1.0

Locale Icon

Locale: ba

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 5.10 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Javanese

A collection of spontaneous responses to questions in Javanese.
License Icon

License: CC0-1.0

Locale Icon

Locale: jv

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 3.67 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Sinhala

A collection of spontaneous responses to questions in Sinhala.
License Icon

License: CC0-1.0

Locale Icon

Locale: si

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 2.52 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Dutch

A collection of spontaneous responses to questions in Dutch.
License Icon

License: CC0-1.0

Locale Icon

Locale: nl

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 2.42 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Shona

A collection of spontaneous responses to questions in Shona.
License Icon

License: CC0-1.0

Locale Icon

Locale: sn

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 1.53 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Bodo

A collection of spontaneous responses to questions in Bodo.
License Icon

License: CC0-1.0

Locale Icon

Locale: brx

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 1.30 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Thai

A collection of spontaneous responses to questions in Thai.
License Icon

License: CC0-1.0

Locale Icon

Locale: th

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 940.22 KB

Common Voice

Common Voice Spontaneous Speech 3.0 - Frisian

A collection of spontaneous responses to questions in Frisian.
License Icon

License: CC0-1.0

Locale Icon

Locale: fy-NL

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 323.25 KB

Common Voice

Common Voice Spontaneous Speech 3.0 - Croatian

A collection of spontaneous responses to questions in Croatian.
License Icon

License: CC0-1.0

Locale Icon

Locale: hr

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 285.11 KB

Common Voice

Common Voice Spontaneous Speech 3.0 - Danish

A collection of spontaneous responses to questions in Danish.
License Icon

License: CC0-1.0

Locale Icon

Locale: da

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 61.80 KB

Common Voice

Common Voice Spontaneous Speech 3.0 - Ruuli

A collection of spontaneous responses to questions in Ruuli.
License Icon

License: CC0-1.0

Locale Icon

Locale: ruc

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 365.95 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Irish

A collection of spontaneous responses to questions in Irish.
License Icon

License: CC0-1.0

Locale Icon

Locale: ga-IE

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 3.14 MB

EELLAK - GreekFOSS

Istorima

Oral history interviews from Istorima archive (transcriptions+metadata) in Greek on social/cultural/historical topics
License Icon

License: CC BY-NC-ND 4.0

Locale Icon

Locale: gr-GR

Task Icon

Task: NLP

Format Icon

Format: PARQUET

Size Icon

Size: 416.02 MB

UP EEEI - Digital Signal Processing Laboratory

UP - DSP - Philippine Languages Database (UP-DSP-PLD)

A multilingual corpora for ten Philippine languages containing over 454 hours of recordings
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: phi

Task Icon

Task: ASR

Format Icon

Format: WAV, LOG

Size Icon

Size: 45.63 GB

Community

Urdu Multi-Speaker TTS Dataset

An Urdu multi-speaker TTS dataset distributed in 36 zip files, each containing audio files and a TSV mapping file, with approximately 10 hours of speech.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: urd

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 514.54 MB

Balochistan Educational and Cultural Organization

BECO Brahui Literature Corpus

A ~355k-token Brahui literary corpus of short stories, novels, and other creative works for linguistic research and NLP.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: brh

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.19 MB

Community

Malayalam Time-Aligned Speech Corpus

A Malayalam speech dataset containing 100 audio files with time-aligned .srt transcriptions from 5 speakers.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: mal

Task Icon

Task: ASR

Format Icon

Format: WAV, SRT

Size Icon

Size: 1.50 GB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part3

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for the Somali language, produced by Digital Divide Data.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: som

Task Icon

Task: ASR

Format Icon

Format: WAV, TSV

Size Icon

Size: 1.33 GB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part2

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for the Somali language, produced by Digital Divide Data.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: som

Task Icon

Task: ASR

Format Icon

Format: WAV, TSV

Size Icon

Size: 8.07 GB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part1

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for Somali language, produced by Digital Divide Data
License Icon

License: CC-BY-4.0

Locale Icon

Locale: som

Task Icon

Task: ASR

Format Icon

Format: WAV, TSV

Size Icon

Size: 7.68 GB

Community

TODa: Tamazight Open Dataset

Welcome to the Tamazight Open Dataset (TODa), a groundbreaking open-source project dedicated to preserving and advancing the Tamazight language. With its extensive collection of linguistic data, TODa stands as a pioneering collaborative project for Tamazight <=> Englis translation, specifically designed for Natural Language Processing applications. TODa's unique approach combines both semantic and syntactic categorization methods, offering a rich representation of words in their various contexts and forms. The dataset encompasses a comprehensive collection of linguistic elements, including detailed verb conjugations across different tenses, noun variations, and an extensive compilation of translated expressions that capture the language's nuances. What sets TODa apart is its inclusive approach to Tamazight's writing systems. The dataset thoughtfully incorporates Latin alphabets, acknowledging and preserving the diverse writing traditions practiced across Amazigh communities. This dual-script approach ensures broader accessibility and cultural authenticity. Our vision is to establish TODa as the cornerstone resource for Tamazight Natural Language Processing. Through this meticulously curated dataset, we strive to empower developers and researchers to create innovative NLP solutions that authentically serve the Amazigh-speaking community. We take pride in our current progress, yet acknowledge that language documentation is an evolving journey. We actively encourage participation from the Amazigh technology community to contribute their expertise in expanding and refining the dataset. Through collaborative effort, we can create a robust foundation for technological innovations that honor and advance Amazigh linguistic heritage.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: zgh

Task Icon

Task: NLP

Format Icon

Format: CSV

Size Icon

Size: 3.27 MB

Community

TTS Balinese Language

This TTS dataset contains Balinese language used in daily activities.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: ban

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 301.05 MB