Datasets
smoltalk-chinese
License: Apache-2.0
Locale: zh
Task: LLM
Format: parquet
Size: 879.81 MB
Common Voice v24 English - en-AU subset for Everything Open 2026
License: CC0-1.0
Locale: en-AU
Task: ASR
Format: CSV, MP3
Size: 1.92 GB
Ewondo_Fong_ALCAM-MultimodalDataset
License: NOODL-1.0
Locale: ewo
Task: NLP
Format: MP3, TSV
Size: 16.80 MB
Informes de Actividades InfoCDMX (Ponencia Laura Enríquez)
License: CC-BY-4.0
Locale: es-MX
Task: NLP
Format: PDF, XSLX
Size: 275.85 MB
Ficha de Documentación de Datos: Resoluciones InfoNL (Ponencia F. Guajardo)
License: CC-BY-4.0
Locale: es-MX
Task: NLP
Format: PDF, XSLX
Size: 1.07 GB
Compar:IA conversations
License: Etalab 2.0
Locale: fr
Task: NLG
Format: PARQUET
Size: 1.81 GB
English–Punjabi (Shahmukhi) Parallel Sentences Corpus (Mediamen Archives)
License: CC-BY-NC-4.0
Locale: en-PK, pnb
Task: MT
Format: CSV
Size: 1.08 MB
HESEIA Sentence Bias Dataset
License: CC-BY-SA-4.0
Locale: es-AR
Task: OTH
Format: CSV
Size: 235.43 KB
RFE/RL Tatar-Bashkir News Text Corpus
License: CC-BY-NC-SA-4.0
Locale: tt,ba,ru
Task: NLP
Format: TXT
Size: 102.44 MB
ddd-kenya-luhya-70hrs-asr
License: CC-BY-4.0
Locale: luy
Task: ASR
Format: WAV, XLSX, TSV
Size: 13.90 GB
Effect AI Scripted Speech 1.0 - English
License: CC0-1.0
Locale: en
Task: TTS
Format: CSV, MP3
Size: 663.45 MB
DataTrust Africa: Speech Corpus of Public Radio Recordings from Northern Uganda
License: NOODL-1.0
Locale: en-US
Task: NLP
Format: MP3
Size: 179.82 MB
Khmer ASR Cultural Dataset
License: CC-BY-SA-4.0
Locale: khm
Task: ASR
Format: WAV
Size: 12.59 GB
Corpus of Panjebar Semangat Javanese-Language Magazine
License: CC-BY-SA-4.0
Locale: Jav
Task: OTH
Format: TXT
Size: 4.31 MB
SI-NLI
License: CC-BY-NC-SA-4.0
Locale: sl
Task: NLU
Format: TSV
Size: 392.44 KB
Vallader Newspaper Corpus
License: CC0-1.0
Locale: rm-vallader
Task: OTH
Format: TSV
Size: 18.71 MB
Multilingual Religious Parallel Corpus (Kaleem Art Press)
License: CC-BY-SA-4.0
Locale: mul
Task: MT
Format: CSV
Size: 2.27 MB
Sindh Line Publishers
License: CC-BY-SA-4.0
Locale: snd
Task: NLP
Format: TXT
Size: 2.22 MB
Spoken-Congolese-French-Dataset
License: NOODL-1.0
Locale: fr-CG
Task: NLP
Format: MP3, WAV, TSV
Size: 3.44 GB
Ewondo_Mbida-Mbani_ALCAM-MultimodalDataset
License: NOODL-1.0
Locale: ewo
Task: NLP
Format: MP3, TSV
Size: 19.25 MB
Balochi Academy Text Corpus
License: CC-BY-NC-SA-4.0
Locale: bgn
Task: NLP
Format: TXT
Size: 1.88 MB
Mada Narratives
License: NOODL-1.0
Locale: mxu
Task: NLP
Format: TXT
Size: 65.04 KB
Surmiran Newspaper Corpus
License: CC0-1.0
Locale: rm-surmiran
Task: OTH
Format: TSV
Size: 11.89 MB
DhoNam: Dholuo Speech dataset
License: NOODL-1.0
Locale: Luo
Task: ASR
Format: WEBM
Size: 2.49 GB