Datasets

Institute of African Digital Humanities

Suundi-TTS-Dataset

The dataset consists of paired audio and text data on Suundi (sdj), a language spoken in Congo. The audio corpus consists of 4,187 clips read by one speaker ...

Gear IconTask: TTS

Folder IconFormat: WAV, TSV

License IconLicense: NOODL-1.0

Size: 240.50 MB

Calendar IconCreated: 12/11/2025

Globe IconLocale: sdj

Institute of African Digital Humanities

Mbosi-TTS-Dataset

The dataset consists of paired audio and text data on Mbosi (mdw), a language spoken in Congo. The audio corpus consists of 2,575 clips read by one speaker t...

Gear IconTask: TTS

Folder IconFormat: WAV, TSV

License IconLicense: NOODL-1.0

Size: 644.39 MB

Calendar IconCreated: 12/11/2025

Globe IconLocale: mdw

Institute of African Digital Humanities

Beembe-TTS-Dataset

The dataset consists of paired audio and text data on Beembe (beq), a language spoken in Congo. The audio corpus consists of 6,933 clips read by one speaker ...

Gear IconTask: TTS

Folder IconFormat: WAV, TSV

License IconLicense: NOODL-1.0

Size: 861.46 MB

Calendar IconCreated: 12/11/2025

Globe IconLocale: beq

Akylai

KyrgyzLLM-Bench: Kyrgyz LLM Evaluation Dataset

KyrgyzLLM-Bench is a comprehensive suite purpose-built to evaluate LLMs’ deep understanding and reasoning in Kyrgyz. It combines natively authored benchmarks...

Gear IconTask: LLM

Folder IconFormat: PARQUET

License IconLicense: mixed

Size: 87.20 MB

Calendar IconCreated: 12/10/2025

Globe IconLocale: ky

Institute of African Digital Humanities

Yaka-TTS-Dataset

Paired audio and text data on Yaka (also known as West Teke), a language spoken in Congo. The audio corpus consists of 7,648 clips read by one speaker for a ...

Gear IconTask: TTS

Folder IconFormat: WAV, TSV

License IconLicense: NOODL-1.0

Size: 1.26 GB

Calendar IconCreated: 12/10/2025

Globe IconLocale: iyx

Institute of African Digital Humanities

Akoose-ALCAM-MultimodalDataset

This dataset comprises a datasheet of Akoose (bss) lexical entries collected from the 'Western-Bakossi' subgroup. Each entry is accompanied by illustrative s...

Gear IconTask: NLP

Folder IconFormat: MP3, TSV

License IconLicense: NOODL-1.0

Size: 16.05 MB

Calendar IconCreated: 12/10/2025

Globe IconLocale: bss

Institute of African Digital Humanities

Kituba-TTS-Dataset

Paired audio and text data on Kituba (mkw), a language spoken in Congo. The audio corpus consists of 8,302 clips read by one speaker, totalling 350 min 11.98...

Gear IconTask: TTS

Folder IconFormat: WAV, TSV

License IconLicense: NOODL-1.0

Size: 553.28 MB

Calendar IconCreated: 12/10/2025

Globe IconLocale: mkw

Taraaz

Multilingual Humanitarian Response Eval (MHRE)

This multilingual humanitarian dataset contains 655 annotated datapoints evaluating AI chatbot safety and quality in migration and asylum scenarios across fo...

Gear IconTask: LLM

Folder IconFormat: csv

License IconLicense: CC-BY-NC-SA-4.0

Size: 2.15 MB

Calendar IconCreated: 12/8/2025

Globe IconLocale: mul

Forum for Language Initiatives

Hussain Faizy Indus Kohistani Corpus

The Indus Kohistani corpus contains around 500k tokens of folktales, stories, poetry, biographies, and conversational texts, all transcribed with a consiste...

Gear IconTask: NLP

Folder IconFormat: TXT

License IconLicense: CC-BY-SA-4.0

Size: 14.70 MB

Calendar IconCreated: 12/8/2025

Globe IconLocale: mvy

Institute of African Digital Humanities

Ewondo-Yanda-ALCAM-MultimodalDataset

This dataset comprises a datasheet of Ewondo (Ewo) lexical entries collected in the Yanda subgroup. Each entry is accompanied by illustrative sentences, word...

Gear IconTask: NLP

Folder IconFormat: MP3, TSV

License IconLicense: NOODL-1.0

Size: 18.09 MB

Calendar IconCreated: 12/7/2025

Globe IconLocale: ewo

Open Home Foundation

Flemishguy 1.0

Text to speech dataset for Dutch, male speaker, approximately 1 hour of read speech.

Gear IconTask: TTS

Folder IconFormat: FLAC

License IconLicense: CC0-1.0

Size: 73.69 MB

Calendar IconCreated: 12/6/2025

Globe IconLocale: nl-BE

Open Home Foundation

Faber 1.0

Text to speech dataset for Brazilian Portuguese, male speaker, approximately 1.5 hours of read speech.

Gear IconTask: TTS

Folder IconFormat: WEBM

License IconLicense: CC0-1.0

Size: 30.98 MB

Calendar IconCreated: 12/6/2025

Globe IconLocale: pt-BR

Open Home Foundation

Darkman 1.0

Text to speech dataset for Polish, male speaker, approximately 2 hours of read speech.

Gear IconTask: TTS

Folder IconFormat: WEBM

License IconLicense: CC0-1.0

Size: 40.42 MB

Calendar IconCreated: 12/6/2025

Globe IconLocale: pl-PL

Open Home Foundation

Jeff 1.0

Text to speech dataset for Brazilian Portuguese, male speaker, approximately 1.5 hours of read speech.

Gear IconTask: TTS

Folder IconFormat: WEBM

License IconLicense: CC0-1.0

Size: 90.74 MB

Calendar IconCreated: 12/6/2025

Globe IconLocale: pt-BR

Open Home Foundation

Mihai 1.0

Text to speech dataset for Romanian, male speaker, approximately 2 hours of read speech.

Gear IconTask: TTS

Folder IconFormat: WEBM

License IconLicense: CC0-1.0

Size: 66.31 MB

Calendar IconCreated: 12/6/2025

Globe IconLocale: ro-RO

Open Home Foundation

Denis 1.0

Text to speech dataset for Russian, male speaker, approximately 2 hours of read speech.

Gear IconTask: TTS

Folder IconFormat: WEBM

License IconLicense: CC0-1.0

Size: 104.52 MB

Calendar IconCreated: 12/6/2025

Globe IconLocale: ru-RU

Open Home Foundation

Dmitri 1.0

Text to speech dataset for Russian, male speaker, approximately 2 hours of read speech.

Gear IconTask: TTS

Folder IconFormat: WEBM

License IconLicense: CC0-1.0

Size: 96.63 MB

Calendar IconCreated: 12/6/2025

Globe IconLocale: ru-RU

Open Home Foundation

Lili 1.0

Text to speech dataset for Slovak, female speaker, approximately 2 hours of read speech.

Gear IconTask: TTS

Folder IconFormat: WEBM

License IconLicense: CC0-1.0

Size: 72.38 MB

Calendar IconCreated: 12/6/2025

Globe IconLocale: sk-SK

Open Home Foundation

Ronnie 1.0

Text to speech dataset for Dutch, male speaker, approximately 2 hours of read speech.

Gear IconTask: TTS

Folder IconFormat: WEBM

License IconLicense: CC0-1.0

Size: 106.23 MB

Calendar IconCreated: 12/6/2025

Globe IconLocale: nl-NL

Open Home Foundation

Cadu 1.0

Text to speech dataset for Brazilian Portuguese, male speaker, approximately 1.5 hours of read speech.

Gear IconTask: TTS

Folder IconFormat: WEBM

License IconLicense: CC0-1.0

Size: 30.98 MB

Calendar IconCreated: 12/6/2025

Globe IconLocale: pt-BR

Open Home Foundation

Tugão 1.0

Text to speech dataset for Portuguese, male speaker, approximately 1.5 hours of read speech.

Gear IconTask: TTS

Folder IconFormat: WEBM

License IconLicense: CC0-1.0

Size: 61.84 MB

Calendar IconCreated: 12/6/2025

Globe IconLocale: pt-PT

Open Home Foundation

Gosia 1.0

Text to speech dataset for Polish, female speaker, approximately 2 hours of read speech.

Gear IconTask: TTS

Folder IconFormat: WEBM

License IconLicense: CC0-1.0

Size: 39.75 MB

Calendar IconCreated: 12/6/2025

Globe IconLocale: pl-PL

Open Home Foundation

Pim 1.0

Text to speech dataset for Dutch, male speaker, approximately 2 hours of read speech.

Gear IconTask: TTS

Folder IconFormat: WEBM

License IconLicense: CC0-1.0

Size: 108.08 MB

Calendar IconCreated: 12/6/2025

Globe IconLocale: nl-NL

Open Home Foundation

Nathalie 1.0

Text to speech dataset for Dutch, female speaker, approximately 1 hour of read speech.

Gear IconTask: TTS

Folder IconFormat: WEBM

License IconLicense: CC0-1.0

Size: 21.87 MB

Calendar IconCreated: 12/6/2025

Globe IconLocale: nl-BE