Datasets

Filters:

Mbosi-TTS-Dataset

The dataset consists of paired audio and text data on Mbosi (mdw), a language spoken in Congo. The audio corpus consists of 2,575 clips read by one speaker totaling 275 min 48.35 sec. The dataset also contains a mapping file of audio and text with 2,597 lines. Each line begins with the name of an audio file, followed by a tab and then the corresponding text excerpt. This dataset is suitable for TTS tasks.

License: NOODL-1.0

Locale: mdw

Task: TTS

Format: WAV, TSV

Size: 644.39 MB

Institute of African Digital Humanities

Beembe-TTS-Dataset

The dataset consists of paired audio and text data on Beembe (beq), a language spoken in Congo. The audio corpus consists of 6,933 clips read by one speaker totaling 275 min 48.35 sec. The dataset also contains a mapping file of audio and text with 4,422 lines. Each line begins with the name of an audio file, followed by a tab and then the corresponding text excerpt. This dataset is suitable for TTS tasks.

License: NOODL-1.0

Locale: beq

Task: TTS

Format: WAV, TSV

Size: 861.46 MB

Akylai

KyrgyzLLM-Bench: Kyrgyz LLM Evaluation Dataset

KyrgyzLLM-Bench is a comprehensive suite purpose-built to evaluate LLMs’ deep understanding and reasoning in Kyrgyz. It combines natively authored benchmarks with carefully translated and post-edited international tasks to provide broad and culturally grounded coverage.

License: mixed

Locale: ky

Task: LLM

Format: PARQUET

Size: 87.20 MB

Institute of African Digital Humanities

Yaka-TTS-Dataset

Paired audio and text data on Yaka (also known as West Teke), a language spoken in Congo. The audio corpus consists of 7,648 clips read by one speaker for a total duration of 344 min 40.48 sec. The dataset also contains a mapping file of audio and text with 7,648 lines. Each line begins with the name of an audio file, followed by a tab and then the corresponding text excerpt. This dataset is suitable for TTS tasks.

License: NOODL-1.0

Locale: iyx

Task: TTS

Format: WAV, TSV

Size: 1.26 GB

Institute of African Digital Humanities

Akoose-ALCAM-MultimodalDataset

This dataset comprises a datasheet of Akoose (bss) lexical entries collected from the 'Western-Bakossi' subgroup. Each entry is accompanied by illustrative sentences, French translations and a word-by-word breakdown of the Akoose sentences, as well as an equivalent breakdown in English. The resource is enriched with aligned audio recordings, making it ideal for linguistic analysis and the development of speech technology.

License: NOODL-1.0

Locale: bss

Task: NLP

Format: MP3, TSV

Size: 16.05 MB

Institute of African Digital Humanities

Kituba-TTS-Dataset

Paired audio and text data on Kituba (mkw), a language spoken in Congo. The audio corpus consists of 8,302 clips read by one speaker, totalling 350 min 11.98 sec. The dataset also contains a mapping file of audio and text with 8,173 lines. Each line begins with the name of an audio file, followed by a tab and then the corresponding text excerpt. This dataset is suitable for TTS tasks.

License: NOODL-1.0

Locale: mkw

Task: TTS

Format: WAV, TSV

Size: 553.28 MB

Taraaz

Multilingual Humanitarian Response Eval (MHRE)

This multilingual humanitarian dataset contains 655 annotated datapoints evaluating AI chatbot safety and quality in migration and asylum scenarios across four language pairs (English–Farsi (Iranian Persian), Arabic, Kurdish (Sorani), Pashto). Built from 120 expert prompts, it includes outputs from GPT-4o, Gemini 2.5 Flash, and Mistral Small. The dataset provides both human evaluations from Respond Crisis Translation native-speaker evaluators and LLM-as-judge assessments (Gemini 2.5 Flash).

License: CC-BY-NC-SA-4.0

Locale: mul

Task: LLM

Format: csv

Size: 2.15 MB

Forum for Language Initiatives

Hussain Faizy Indus Kohistani Corpus

The Indus Kohistani corpus contains around 500k tokens of folktales, stories, poetry, biographies, and conversational texts, all transcribed with a consistent community orthography. Reviewed by native speakers, the corpus offers a representative snapshot of the language’s vocabulary and grammar for linguistic and computational research.

License: CC-BY-SA-4.0

Locale: mvy

Task: NLP

Format: TXT

Size: 14.70 MB

Institute of African Digital Humanities

Ewondo-Yanda-ALCAM-MultimodalDataset

This dataset comprises a datasheet of Ewondo (Ewo) lexical entries collected in the Yanda subgroup. Each entry is accompanied by illustrative sentences, word-by-word glosses and French translations. The resource is enriched with aligned audio recordings, making it suitable for linguistic analysis and speech technology development.

License: NOODL-1.0

Locale: ewo

Task: NLP

Format: MP3, TSV

Size: 18.09 MB

Open Home Foundation

Flemishguy 1.0

Text to speech dataset for Dutch, male speaker, approximately 1 hour of read speech.

License: CC0-1.0

Locale: nl-BE

Task: TTS

Format: FLAC

Size: 73.69 MB

Open Home Foundation

Faber 1.0

Text to speech dataset for Brazilian Portuguese, male speaker, approximately 1.5 hours of read speech.

License: CC0-1.0

Locale: pt-BR

Task: TTS

Format: WEBM

Size: 30.98 MB

Open Home Foundation

Darkman 1.0

Text to speech dataset for Polish, male speaker, approximately 2 hours of read speech.

License: CC0-1.0

Locale: pl-PL

Task: TTS

Format: WEBM

Size: 40.42 MB

Open Home Foundation

Jeff 1.0

Text to speech dataset for Brazilian Portuguese, male speaker, approximately 1.5 hours of read speech.

License: CC0-1.0

Locale: pt-BR

Task: TTS

Format: WEBM

Size: 90.74 MB

Open Home Foundation

Mihai 1.0

Text to speech dataset for Romanian, male speaker, approximately 2 hours of read speech.

License: CC0-1.0

Locale: ro-RO

Task: TTS

Format: WEBM

Size: 66.31 MB

Open Home Foundation

Denis 1.0

Text to speech dataset for Russian, male speaker, approximately 2 hours of read speech.

License: CC0-1.0

Locale: ru-RU

Task: TTS

Format: WEBM

Size: 104.52 MB

Open Home Foundation

Dmitri 1.0

Text to speech dataset for Russian, male speaker, approximately 2 hours of read speech.

License: CC0-1.0

Locale: ru-RU

Task: TTS

Format: WEBM

Size: 96.63 MB

Open Home Foundation

Lili 1.0

Text to speech dataset for Slovak, female speaker, approximately 2 hours of read speech.

License: CC0-1.0

Locale: sk-SK

Task: TTS

Format: WEBM

Size: 72.38 MB

Open Home Foundation

Ronnie 1.0

Text to speech dataset for Dutch, male speaker, approximately 2 hours of read speech.

License: CC0-1.0

Locale: nl-NL

Task: TTS

Format: WEBM

Size: 106.23 MB

Open Home Foundation

Cadu 1.0

Text to speech dataset for Brazilian Portuguese, male speaker, approximately 1.5 hours of read speech.

License: CC0-1.0

Locale: pt-BR

Task: TTS

Format: WEBM

Size: 30.98 MB

Open Home Foundation

Tugão 1.0

Text to speech dataset for Portuguese, male speaker, approximately 1.5 hours of read speech.

License: CC0-1.0

Locale: pt-PT

Task: TTS

Format: WEBM

Size: 61.84 MB

Open Home Foundation

Gosia 1.0

Text to speech dataset for Polish, female speaker, approximately 2 hours of read speech.

License: CC0-1.0

Locale: pl-PL

Task: TTS

Format: WEBM

Size: 39.75 MB

Open Home Foundation

Pim 1.0

Text to speech dataset for Dutch, male speaker, approximately 2 hours of read speech.

License: CC0-1.0

Locale: nl-NL

Task: TTS

Format: WEBM

Size: 108.08 MB

Open Home Foundation

Nathalie 1.0

Text to speech dataset for Dutch, female speaker, approximately 1 hour of read speech.

License: CC0-1.0

Locale: nl-BE

Task: TTS

Format: WEBM

Size: 21.87 MB

Open Home Foundation

Chitwan 1.0

Text to speech dataset for Nepali, male speaker, approximately 1 hour of read speech.

License: CC0-1.0

Locale: ne-NE

Task: TTS

Format: WEBM

Size: 61.68 MB