Datasets

Filters:

Search results for “support”

OpenCSG

smoltalk-chinese

SmolTalk-Chinese: A multi-task Chinese conversational dataset covering 19 typical dialogue task scenarios.

License: Apache-2.0

Locale: zh

Task: LLM

Format: parquet

Size: 879.81 MB

Institute of African Digital Humanities

Ewondo_Fong_ALCAM-MultimodalDataset

A multimodal linguistic resource comprising a curated datasheet of example sentences in Ewondo (Fong variety) and their French equivalents, along with their corresponding audio recordings and a sentence–audio alignment file. It is designed to support research, documentation and pedagogy in the field of speech and language technology for under-resourced African languages.

License: NOODL-1.0

Locale: ewo

Task: NLP

Format: MP3, TSV

Size: 16.80 MB

Common Voice

Common Voice Scripted Speech 24.0 - Loja Highland Kichwa

A collection of scripted spoken phrases in Loja Highland Kichwa.

License: CC0-1.0

Locale: qvj

Task: ASR

Format: MP3

Size: 221.72 MB

Collaborative Action For Research & Development (CARD)

Gawri (گاؤری) Magazine Corpus

The Gawri (گاؤری) Magazine Corpus is a curated collection of monthly Gawri magazine text (~67,724 tokens) that supports research and language technology.

License: CC-BY-NC-4.0

Locale: gwc

Task: NLP

Format: TXT

Size: 146.71 KB

Common Voice

Common Voice Scripted Speech 24.0 - Losso

A collection of scripted spoken phrases in Losso.

License: CC0-1.0

Locale: nmz

Task: ASR

Format: MP3

Size: 205.70 MB

MEDIAMEN

English–Punjabi (Shahmukhi) Parallel Sentences Corpus (Mediamen Archives)

This parallel sentences corpus containing 30,405 aligned sentence pairs with a total of approximately 0.62 million tokens, curated from the archival materials of Mediamen (Advertising Agency). The sentences were professionally translated from English into Punjabi (Shahmukhi) and are intended to support machine translation, linguistic research, and Punjabi language technology development, particularly for real-world and contemporary language use.

License: CC-BY-NC-4.0

Locale: en-PK, pnb

Task: MT

Format: CSV

Size: 1.08 MB

Common Voice

Common Voice Scripted Speech 24.0 - Xitsonga

A collection of scripted spoken phrases in Xitsonga.

License: CC0-1.0

Locale: ts

Task: ASR

Format: MP3

Size: 1016.43 KB

Institute of African Digital Humanities

Adamawa Fulfulde-French Parallel Corpus of Narratives 1.2

This dataset is an updated version of the 'Adamawa Fulfulde–French Parallel Corpus of Narratives 1.1'.

License: NOODL-1.0

Locale: fub

Task: MT

Format: TSV

Size: 112.17 KB

Common Voice

Common Voice Scripted Speech 24.0 - Cantonese

A collection of scripted spoken phrases in Cantonese.

License: CC0-1.0

Locale: yue

Task: ASR

Format: MP3

Size: 5.98 GB

ComparIA

Compar:IA conversations

French conversational AI conversation and preference dataset with 396K conversations from 50+ LLMs

License: Etalab 2.0

Locale: fr

Task: NLG

Format: PARQUET

Size: 1.81 GB

Common Voice

Common Voice Scripted Speech 24.0 - Kom

A collection of scripted spoken phrases in Kom.

License: CC0-1.0

Locale: bkm

Task: ASR

Format: MP3

Size: 253.86 MB

Universitas Gadjah Mada

Jember Javanese Spontaneous Speech Corpus

A 10-hour spoken dataset of native Javanese speakers from Jember, East Java, representing the Jember dialect and Pandhalungan variety in natural speech.

License: CC-BY-NC-SA-4.0

Locale: jav

Task: ASR

Format: MP3, TSV

Size: 271.65 MB

Common Voice

Mozilla Common Voice Spontaneous Speech ASR Shared Task Train/Dev Data

This datasheet is for the bundle of Mozilla Common Voice spontaneous speech datasets to be used in the Shared Task on Spontaneous Speech.

License: CC0-1.0

Locale: mul

Task: ASR

Format: mp3

Size: 4.30 GB

Balochistan Educational and Cultural Organization

Western Balochi Literature Cropus

A cleaned literary corpus in Western Balochi (Rakhshani) including articles, research, translations, and creative literature. (~1.1 m tokens)

License: CC-BY-NC-4.0

Locale: bgn

Task: NLP

Format: TXT

Size: 2.26 MB

Common Voice

Common Voice Scripted Speech 24.0 - Paiwan

A collection of scripted spoken phrases in Paiwan.

License: CC0-1.0

Locale: pwn

Task: ASR

Format: MP3

Size: 280.68 MB

Common Voice

Common Voice 7.0 - Single Word Target Segment

This dataset contains the numbers 0 to 9 and the words "yes" and "no" in 34 languages. It contains 84 validated hours of speech.

License: CC0-1.0

Locale: mul

Task: ASR

Format: TSV, MP3

Size: 3.51 GB

Balochistan Educational and Cultural Organization

Brahui Research Work Corpus

This corpus consists of research works, academic theses, and scholarly papers, comprising approximately 185,000 tokens. It covers a range of academic topics and formal registers, reflecting standardized writing practices and disciplinary conventions. The corpus is intended to support linguistic analysis, discourse studies, and computational research on academic and research-oriented text.

License: CC-BY-NC-SA-4.0

Locale: brh

Task: NLP

Format: TXT

Size: 1.13 MB

Common Voice

Common Voice Scripted Speech 24.0 - Mokpwe

A collection of scripted spoken phrases in Mokpwe.

License: CC0-1.0

Locale: bri

Task: ASR

Format: MP3

Size: 188.52 MB

Forum for Language Initiatives

Kohistani Shina Word List

A UTF-8 encoded Kohistani Shina dictionary and word list corpus with lexical entries, meanings, and annotations for linguistic research and NLP use.

License: CC-BY-NC-4.0

Locale: plk

Task: NLP

Format: TXT

Size: 394.05 KB

Common Voice

Common Voice Scripted Speech 24.0 - Toki Pona

A collection of scripted spoken phrases in Toki Pona.

License: CC0-1.0

Locale: tok

Task: ASR

Format: MP3

Size: 464.92 MB

Forum for Language Initiatives

Khowar Word List

A UTF-8 encoded Khowar lexical corpus containing alphabet definitions and a structured word list for linguistic research and language technology development.

License: CC-BY-NC-4.0

Locale: khw

Task: NLP

Format: TXT

Size: 64.22 KB

Community

Thorsten-Voice Dataset 2021.06 Emotional

German emotional speech dataset (2,400 recordings, 8 emotions), CC0 licensed, 22,050 Hz mono WAV, for TTS and speech research.

License: CC0-1.0

Locale: de-DE

Task: TTS

Format: WAV,CSV

Size: 380.80 MB

Forum for Language Initiatives

Khowar Literature Corpus by FLI

A multi-genre Khowar language corpus designed for linguistic research, NLP applications, and cultural documentation.

License: CC-BY-NC-4.0

Locale: khw

Task: NLP

Format: TXT

Size: 244.85 KB

Rerooted Archive

ReRooted: Speech Corpus of Testimonials from Armenian Refugees and Immigrants

ReRooted is an open-access online YouTube corpus of interviews with Armenian refugees and immigrants. As of now, the online corpus has over 80hrs of recordings on YouTube, alongside subtitles in Armenian from Amara. The current dataset is 10 hours from the corpus that we have curated with corrected and finely-annotated transcripts. The dataset is for use in NLP research, and we hope to continuously update the dataset with more curated transcripts.

License: GPL-3.0

Locale: hy

Task: ASR

Format: WAV, TEXTGRID

Size: 3.25 GB