Datasets
Institute of African Digital Humanities
Suundi-TTS-Dataset
The dataset consists of paired audio and text data on Suundi (sdj), a language spoken in Congo. The audio corpus consists of 4,187 clips read by one speaker ...
Task: TTS
Format: WAV, TSV
License: NOODL-1.0
Size: 240.50 MB
Created: 12/11/2025
Locale: sdj
Institute of African Digital Humanities
Mbosi-TTS-Dataset
The dataset consists of paired audio and text data on Mbosi (mdw), a language spoken in Congo. The audio corpus consists of 2,575 clips read by one speaker t...
Task: TTS
Format: WAV, TSV
License: NOODL-1.0
Size: 644.39 MB
Created: 12/11/2025
Locale: mdw
Institute of African Digital Humanities
Beembe-TTS-Dataset
The dataset consists of paired audio and text data on Beembe (beq), a language spoken in Congo. The audio corpus consists of 6,933 clips read by one speaker ...
Task: TTS
Format: WAV, TSV
License: NOODL-1.0
Size: 861.46 MB
Created: 12/11/2025
Locale: beq
Akylai
KyrgyzLLM-Bench: Kyrgyz LLM Evaluation Dataset
KyrgyzLLM-Bench is a comprehensive suite purpose-built to evaluate LLMs’ deep understanding and reasoning in Kyrgyz. It combines natively authored benchmarks...
Task: LLM
Format: PARQUET
License: mixed
Size: 87.20 MB
Created: 12/10/2025
Locale: ky
Institute of African Digital Humanities
Yaka-TTS-Dataset
Paired audio and text data on Yaka (also known as West Teke), a language spoken in Congo. The audio corpus consists of 7,648 clips read by one speaker for a ...
Task: TTS
Format: WAV, TSV
License: NOODL-1.0
Size: 1.26 GB
Created: 12/10/2025
Locale: iyx
Institute of African Digital Humanities
Akoose-ALCAM-MultimodalDataset
This dataset comprises a datasheet of Akoose (bss) lexical entries collected from the 'Western-Bakossi' subgroup. Each entry is accompanied by illustrative s...
Task: NLP
Format: MP3, TSV
License: NOODL-1.0
Size: 16.05 MB
Created: 12/10/2025
Locale: bss
Institute of African Digital Humanities
Kituba-TTS-Dataset
Paired audio and text data on Kituba (mkw), a language spoken in Congo. The audio corpus consists of 8,302 clips read by one speaker, totalling 350 min 11.98...
Task: TTS
Format: WAV, TSV
License: NOODL-1.0
Size: 553.28 MB
Created: 12/10/2025
Locale: mkw
Taraaz
Multilingual Humanitarian Response Eval (MHRE)
This multilingual humanitarian dataset contains 655 annotated datapoints evaluating AI chatbot safety and quality in migration and asylum scenarios across fo...
Task: LLM
Format: csv
License: CC-BY-NC-SA-4.0
Size: 2.15 MB
Created: 12/8/2025
Locale: mul
Forum for Language Initiatives
Hussain Faizy Indus Kohistani Corpus
The Indus Kohistani corpus contains around 500k tokens of folktales, stories, poetry, biographies, and conversational texts, all transcribed with a consiste...
Task: NLP
Format: TXT
License: CC-BY-SA-4.0
Size: 14.70 MB
Created: 12/8/2025
Locale: mvy
Institute of African Digital Humanities
Ewondo-Yanda-ALCAM-MultimodalDataset
This dataset comprises a datasheet of Ewondo (Ewo) lexical entries collected in the Yanda subgroup. Each entry is accompanied by illustrative sentences, word...
Task: NLP
Format: MP3, TSV
License: NOODL-1.0
Size: 18.09 MB
Created: 12/7/2025
Locale: ewo
Open Home Foundation
Flemishguy 1.0
Text to speech dataset for Dutch, male speaker, approximately 1 hour of read speech.
Task: TTS
Format: FLAC
License: CC0-1.0
Size: 73.69 MB
Created: 12/6/2025
Locale: nl-BE
Open Home Foundation
Faber 1.0
Text to speech dataset for Brazilian Portuguese, male speaker, approximately 1.5 hours of read speech.
Task: TTS
Format: WEBM
License: CC0-1.0
Size: 30.98 MB
Created: 12/6/2025
Locale: pt-BR
Open Home Foundation
Darkman 1.0
Text to speech dataset for Polish, male speaker, approximately 2 hours of read speech.
Task: TTS
Format: WEBM
License: CC0-1.0
Size: 40.42 MB
Created: 12/6/2025
Locale: pl-PL
Open Home Foundation
Jeff 1.0
Text to speech dataset for Brazilian Portuguese, male speaker, approximately 1.5 hours of read speech.
Task: TTS
Format: WEBM
License: CC0-1.0
Size: 90.74 MB
Created: 12/6/2025
Locale: pt-BR
Open Home Foundation
Mihai 1.0
Text to speech dataset for Romanian, male speaker, approximately 2 hours of read speech.
Task: TTS
Format: WEBM
License: CC0-1.0
Size: 66.31 MB
Created: 12/6/2025
Locale: ro-RO
Open Home Foundation
Denis 1.0
Text to speech dataset for Russian, male speaker, approximately 2 hours of read speech.
Task: TTS
Format: WEBM
License: CC0-1.0
Size: 104.52 MB
Created: 12/6/2025
Locale: ru-RU
Open Home Foundation
Dmitri 1.0
Text to speech dataset for Russian, male speaker, approximately 2 hours of read speech.
Task: TTS
Format: WEBM
License: CC0-1.0
Size: 96.63 MB
Created: 12/6/2025
Locale: ru-RU
Open Home Foundation
Lili 1.0
Text to speech dataset for Slovak, female speaker, approximately 2 hours of read speech.
Task: TTS
Format: WEBM
License: CC0-1.0
Size: 72.38 MB
Created: 12/6/2025
Locale: sk-SK
Open Home Foundation
Ronnie 1.0
Text to speech dataset for Dutch, male speaker, approximately 2 hours of read speech.
Task: TTS
Format: WEBM
License: CC0-1.0
Size: 106.23 MB
Created: 12/6/2025
Locale: nl-NL
Open Home Foundation
Cadu 1.0
Text to speech dataset for Brazilian Portuguese, male speaker, approximately 1.5 hours of read speech.
Task: TTS
Format: WEBM
License: CC0-1.0
Size: 30.98 MB
Created: 12/6/2025
Locale: pt-BR
Open Home Foundation
Tugão 1.0
Text to speech dataset for Portuguese, male speaker, approximately 1.5 hours of read speech.
Task: TTS
Format: WEBM
License: CC0-1.0
Size: 61.84 MB
Created: 12/6/2025
Locale: pt-PT
Open Home Foundation
Gosia 1.0
Text to speech dataset for Polish, female speaker, approximately 2 hours of read speech.
Task: TTS
Format: WEBM
License: CC0-1.0
Size: 39.75 MB
Created: 12/6/2025
Locale: pl-PL
Open Home Foundation
Pim 1.0
Text to speech dataset for Dutch, male speaker, approximately 2 hours of read speech.
Task: TTS
Format: WEBM
License: CC0-1.0
Size: 108.08 MB
Created: 12/6/2025
Locale: nl-NL
Open Home Foundation
Nathalie 1.0
Text to speech dataset for Dutch, female speaker, approximately 1 hour of read speech.
Task: TTS
Format: WEBM
License: CC0-1.0
Size: 21.87 MB
Created: 12/6/2025
Locale: nl-BE
