Datasets

Filters:

Common Voice

Common Voice Spontaneous Speech 3.0 - Georgian

A collection of spontaneous responses to questions in Georgian.

License: CC0-1.0

Locale: ka

Task: ASR

Format: MP3

Size: 11.61 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Rakhine

A collection of spontaneous responses to questions in Rakhine.

License: CC0-1.0

Locale: rki

Task: ASR

Format: MP3

Size: 11.20 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Bashkir

A collection of spontaneous responses to questions in Bashkir.

License: CC0-1.0

Locale: ba

Task: ASR

Format: MP3

Size: 5.10 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Javanese

A collection of spontaneous responses to questions in Javanese.

License: CC0-1.0

Locale: jv

Task: ASR

Format: MP3

Size: 3.67 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Sinhala

A collection of spontaneous responses to questions in Sinhala.

License: CC0-1.0

Locale: si

Task: ASR

Format: MP3

Size: 2.52 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Dutch

A collection of spontaneous responses to questions in Dutch.

License: CC0-1.0

Locale: nl

Task: ASR

Format: MP3

Size: 2.42 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Shona

A collection of spontaneous responses to questions in Shona.

License: CC0-1.0

Locale: sn

Task: ASR

Format: MP3

Size: 1.53 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Bodo

A collection of spontaneous responses to questions in Bodo.

License: CC0-1.0

Locale: brx

Task: ASR

Format: MP3

Size: 1.30 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Thai

A collection of spontaneous responses to questions in Thai.

License: CC0-1.0

Locale: th

Task: ASR

Format: MP3

Size: 940.22 KB

Common Voice

Common Voice Spontaneous Speech 3.0 - Frisian

A collection of spontaneous responses to questions in Frisian.

License: CC0-1.0

Locale: fy-NL

Task: ASR

Format: MP3

Size: 323.25 KB

Common Voice

Common Voice Spontaneous Speech 3.0 - Croatian

A collection of spontaneous responses to questions in Croatian.

License: CC0-1.0

Locale: hr

Task: ASR

Format: MP3

Size: 285.11 KB

Common Voice

Common Voice Spontaneous Speech 3.0 - Danish

A collection of spontaneous responses to questions in Danish.

License: CC0-1.0

Locale: da

Task: ASR

Format: MP3

Size: 61.80 KB

Common Voice

Common Voice Spontaneous Speech 3.0 - Ruuli

A collection of spontaneous responses to questions in Ruuli.

License: CC0-1.0

Locale: ruc

Task: ASR

Format: MP3

Size: 365.95 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Irish

A collection of spontaneous responses to questions in Irish.

License: CC0-1.0

Locale: ga-IE

Task: ASR

Format: MP3

Size: 3.14 MB

EELLAK - GreekFOSS

Istorima

Oral history interviews from Istorima archive (transcriptions+metadata) in Greek on social/cultural/historical topics

License: CC BY-NC-ND 4.0

Locale: gr-GR

Task: NLP

Format: PARQUET

Size: 416.02 MB

UP EEEI - Digital Signal Processing Laboratory

UP - DSP - Philippine Languages Database (UP-DSP-PLD)

A multilingual corpora for ten Philippine languages containing over 454 hours of recordings

License: CC-BY-NC-4.0

Locale: phi

Task: ASR

Format: WAV, LOG

Size: 45.63 GB

Community

Urdu Multi-Speaker TTS Dataset

An Urdu multi-speaker TTS dataset distributed in 36 zip files, each containing audio files and a TSV mapping file, with approximately 10 hours of speech.

License: CC-BY-NC-4.0

Locale: urd

Task: TTS

Format: WEBM, TSV

Size: 514.54 MB

Balochistan Educational and Cultural Organization

BECO Brahui Literature Corpus

A ~355k-token Brahui literary corpus of short stories, novels, and other creative works for linguistic research and NLP.

License: CC-BY-NC-SA-4.0

Locale: brh

Task: NLP

Format: TXT

Size: 1.19 MB

Community

Malayalam Time-Aligned Speech Corpus

A Malayalam speech dataset containing 100 audio files with time-aligned .srt transcriptions from 5 speakers.

License: CC-BY-NC-4.0

Locale: mal

Task: ASR

Format: WAV, SRT

Size: 1.50 GB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part3

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for the Somali language, produced by Digital Divide Data.

License: CC-BY-4.0

Locale: som

Task: ASR

Format: WAV, TSV

Size: 1.33 GB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part2

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for the Somali language, produced by Digital Divide Data.

License: CC-BY-4.0

Locale: som

Task: ASR

Format: WAV, TSV

Size: 8.07 GB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part1

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for Somali language, produced by Digital Divide Data

License: CC-BY-4.0

Locale: som

Task: ASR

Format: WAV, TSV

Size: 7.68 GB

Community

TODa: Tamazight Open Dataset

Welcome to the Tamazight Open Dataset (TODa), a groundbreaking open-source project dedicated to preserving and advancing the Tamazight language. With its extensive collection of linguistic data, TODa stands as a pioneering collaborative project for Tamazight <=> Englis translation, specifically designed for Natural Language Processing applications. TODa's unique approach combines both semantic and syntactic categorization methods, offering a rich representation of words in their various contexts and forms. The dataset encompasses a comprehensive collection of linguistic elements, including detailed verb conjugations across different tenses, noun variations, and an extensive compilation of translated expressions that capture the language's nuances. What sets TODa apart is its inclusive approach to Tamazight's writing systems. The dataset thoughtfully incorporates Latin alphabets, acknowledging and preserving the diverse writing traditions practiced across Amazigh communities. This dual-script approach ensures broader accessibility and cultural authenticity. Our vision is to establish TODa as the cornerstone resource for Tamazight Natural Language Processing. Through this meticulously curated dataset, we strive to empower developers and researchers to create innovative NLP solutions that authentically serve the Amazigh-speaking community. We take pride in our current progress, yet acknowledge that language documentation is an evolving journey. We actively encourage participation from the Amazigh technology community to contribute their expertise in expanding and refining the dataset. Through collaborative effort, we can create a robust foundation for technological innovations that honor and advance Amazigh linguistic heritage.

License: CC-BY-4.0

Locale: zgh

Task: NLP

Format: CSV

Size: 3.27 MB

Community

TTS Balinese Language

This TTS dataset contains Balinese language used in daily activities.

License: CC-BY-SA-4.0

Locale: ban

Task: TTS

Format: WEBM, TSV

Size: 301.05 MB