Datasets

Filters:
Search results for “support”
OpenCSG

smoltalk-chinese

SmolTalk-Chinese: A multi-task Chinese conversational dataset covering 19 typical dialogue task scenarios.
License Icon

License: Apache-2.0

Locale Icon

Locale: zh

Task Icon

Task: LLM

Format Icon

Format: parquet

Size Icon

Size: 879.81 MB

Institute of African Digital Humanities

Ewondo_Fong_ALCAM-MultimodalDataset

A multimodal linguistic resource comprising a curated datasheet of example sentences in Ewondo (Fong variety) and their French equivalents, along with their corresponding audio recordings and a sentence–audio alignment file. It is designed to support research, documentation and pedagogy in the field of speech and language technology for under-resourced African languages.
License Icon

License: NOODL-1.0

Locale Icon

Locale: ewo

Task Icon

Task: NLP

Format Icon

Format: MP3, TSV

Size Icon

Size: 16.80 MB

Common Voice

Common Voice Scripted Speech 24.0 - Loja Highland Kichwa

A collection of scripted spoken phrases in Loja Highland Kichwa.
License Icon

License: CC0-1.0

Locale Icon

Locale: qvj

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 221.72 MB

Collaborative Action For Research & Development (CARD)

Gawri (گاؤری) Magazine Corpus

The Gawri (گاؤری) Magazine Corpus is a curated collection of monthly Gawri magazine text (~67,724 tokens) that supports research and language technology.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: gwc

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 146.71 KB

Common Voice

Common Voice Scripted Speech 24.0 - Losso

A collection of scripted spoken phrases in Losso.
License Icon

License: CC0-1.0

Locale Icon

Locale: nmz

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 205.70 MB

MEDIAMEN

English–Punjabi (Shahmukhi) Parallel Sentences Corpus (Mediamen Archives)

This parallel sentences corpus containing 30,405 aligned sentence pairs with a total of approximately 0.62 million tokens, curated from the archival materials of Mediamen (Advertising Agency). The sentences were professionally translated from English into Punjabi (Shahmukhi) and are intended to support machine translation, linguistic research, and Punjabi language technology development, particularly for real-world and contemporary language use.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: en-PK, pnb

Task Icon

Task: MT

Format Icon

Format: CSV

Size Icon

Size: 1.08 MB

Common Voice

Common Voice Scripted Speech 24.0 - Xitsonga

A collection of scripted spoken phrases in Xitsonga.
License Icon

License: CC0-1.0

Locale Icon

Locale: ts

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 1016.43 KB

Institute of African Digital Humanities

Adamawa Fulfulde-French Parallel Corpus of Narratives 1.2

This dataset is an updated version of the 'Adamawa Fulfulde–French Parallel Corpus of Narratives 1.1'.
License Icon

License: NOODL-1.0

Locale Icon

Locale: fub

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 112.17 KB

Common Voice

Common Voice Scripted Speech 24.0 - Cantonese

A collection of scripted spoken phrases in Cantonese.
License Icon

License: CC0-1.0

Locale Icon

Locale: yue

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 5.98 GB

ComparIA

Compar:IA conversations

French conversational AI conversation and preference dataset with 396K conversations from 50+ LLMs
License Icon

License: Etalab 2.0

Locale Icon

Locale: fr

Task Icon

Task: NLG

Format Icon

Format: PARQUET

Size Icon

Size: 1.81 GB

Common Voice

Common Voice Scripted Speech 24.0 - Kom

A collection of scripted spoken phrases in Kom.
License Icon

License: CC0-1.0

Locale Icon

Locale: bkm

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 253.86 MB

Universitas Gadjah Mada

Jember Javanese Spontaneous Speech Corpus

A 10-hour spoken dataset of native Javanese speakers from Jember, East Java, representing the Jember dialect and Pandhalungan variety in natural speech.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: ASR

Format Icon

Format: MP3, TSV

Size Icon

Size: 271.65 MB

Common Voice

Mozilla Common Voice Spontaneous Speech ASR Shared Task Train/Dev Data

This datasheet is for the bundle of Mozilla Common Voice spontaneous speech datasets to be used in the Shared Task on Spontaneous Speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: mul

Task Icon

Task: ASR

Format Icon

Format: mp3

Size Icon

Size: 4.30 GB

Balochistan Educational and Cultural Organization

Western Balochi Literature Cropus

A cleaned literary corpus in Western Balochi (Rakhshani) including articles, research, translations, and creative literature. (~1.1 m tokens)
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: bgn

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.26 MB

Common Voice

Common Voice Scripted Speech 24.0 - Paiwan

A collection of scripted spoken phrases in Paiwan.
License Icon

License: CC0-1.0

Locale Icon

Locale: pwn

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 280.68 MB

Common Voice

Common Voice 7.0 - Single Word Target Segment

This dataset contains the numbers 0 to 9 and the words "yes" and "no" in 34 languages. It contains 84 validated hours of speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: mul

Task Icon

Task: ASR

Format Icon

Format: TSV, MP3

Size Icon

Size: 3.51 GB

Balochistan Educational and Cultural Organization

Brahui Research Work Corpus

This corpus consists of research works, academic theses, and scholarly papers, comprising approximately 185,000 tokens. It covers a range of academic topics and formal registers, reflecting standardized writing practices and disciplinary conventions. The corpus is intended to support linguistic analysis, discourse studies, and computational research on academic and research-oriented text.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: brh

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.13 MB

Common Voice

Common Voice Scripted Speech 24.0 - Mokpwe

A collection of scripted spoken phrases in Mokpwe.
License Icon

License: CC0-1.0

Locale Icon

Locale: bri

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 188.52 MB

Forum for Language Initiatives

Kohistani Shina Word List

A UTF-8 encoded Kohistani Shina dictionary and word list corpus with lexical entries, meanings, and annotations for linguistic research and NLP use.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: plk

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 394.05 KB

Common Voice

Common Voice Scripted Speech 24.0 - Toki Pona

A collection of scripted spoken phrases in Toki Pona.
License Icon

License: CC0-1.0

Locale Icon

Locale: tok

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 464.92 MB

Forum for Language Initiatives

Khowar Word List

A UTF-8 encoded Khowar lexical corpus containing alphabet definitions and a structured word list for linguistic research and language technology development.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: khw

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 64.22 KB

Community

Thorsten-Voice Dataset 2021.06 Emotional

German emotional speech dataset (2,400 recordings, 8 emotions), CC0 licensed, 22,050 Hz mono WAV, for TTS and speech research.
License Icon

License: CC0-1.0

Locale Icon

Locale: de-DE

Task Icon

Task: TTS

Format Icon

Format: WAV,CSV

Size Icon

Size: 380.80 MB

Forum for Language Initiatives

Khowar Literature Corpus by FLI

A multi-genre Khowar language corpus designed for linguistic research, NLP applications, and cultural documentation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: khw

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 244.85 KB

Rerooted Archive

ReRooted: Speech Corpus of Testimonials from Armenian Refugees and Immigrants

ReRooted is an open-access online YouTube corpus of interviews with Armenian refugees and immigrants. As of now, the online corpus has over 80hrs of recordings on YouTube, alongside subtitles in Armenian from Amara. The current dataset is 10 hours from the corpus that we have curated with corrected and finely-annotated transcripts. The dataset is for use in NLP research, and we hope to continuously update the dataset with more curated transcripts.
License Icon

License: GPL-3.0

Locale Icon

Locale: hy

Task Icon

Task: ASR

Format Icon

Format: WAV, TEXTGRID

Size Icon

Size: 3.25 GB