Datasets

Filters:

Archivo de la Comisionada María de los Ángeles Guzmán García (COTAI Nuevo León / InfoNL)

Este archivo preserva la memoria institucional y académica de la gestión de la Dra. María de los Ángeles Guzmán García como Comisionada de la Comisión de Transparencia y Acceso a la Información del Estado de Nuevo León (COTAI / INFONL) durante el periodo 2018-2025. El dataset consolida el legado documental de una de las perfiles más técnicos y académicos del Sistema Nacional de Transparencia. Doctora en Derecho Constitucional por la Universidad Complutense de Madrid, la Comisionada Guzmán García

License: CC-BY-4.0

Locale: es-MX

Task: NLP

Format: ZIP, PDF, CSV, XLSX

Size: 866.15 MB

IFIT

Improving AI Conflict Resolution Capacities: A Prompts-Based Evaluation

Findings of a follow-up study assessing how leading free-access LLMs perform when adding instructions directing them to apply basic conflict-resolution practices.

License: CC-BY-4.0

Locale: mul

Task: NLP

Format: CSV, PDF

Size: 1.46 MB

IFIT

AI on the Frontline: Evaluating Large Language Models in Real-World Conflict Resolution

Findings of an experimental evaluation to assess how leading free-access LLMs perform when asked to respond to realistic conflict resolution scenarios.

License: CC-BY-4.0

Locale: en-US

Task: NLP

Format: CSV, PDF

Size: 2.36 MB

Institute of African Digital Humanities

Teke-Laali-TTS-Dataset

The dataset contains paired audio and text resources for Teke-Laali, a Bantu language spoken in the Congo. It consists of seven folders containing a total of 9,069 audio clips from raw audio recordings, with a total duration of 7:01:50.126 (HH:MM:SS.mmm). Additionally, there are seven audio/text mapping files containing a total of 9,069 lines. The dataset is suitable for TTS tasks.

License: NOODL-1.0

Locale: lli

Task: TTS

Format: WAV, TSV

Size: 635.61 MB

Institute of African Digital Humanities

Bati-MultiDialectalASR-Dataset

This dataset contains paired audio and text resources for three Bati dialects (Kelleng, Mbougue, and Nyambat), which belong to the Yambasa group of Bantu languages found in Cameroon. It contains 13,344 audio clips totalling 6 hours, 8 minutes and 12.286 seconds and 44 audio/text mapping files totalling 13,346 lines. Due to its cross-dialectal nature, the dataset is suitable for multilingual automatic speech recognition tasks.

License: NOODL-1.0

Locale: btc

Task: ASR

Format: WAV, TSV

Size: 3.27 GB

Common Voice

Mozilla Common Voice Text Language Identification dataset

A dataset for text-based language identification of 19 Million sentences from over 300 languages taken from Mozilla Common Voice scripted (v23) and spontaneous (v1) speech projects.

License: CC0-1.0

Locale: mul

Task: NLP

Format: TSV

Size: 950.41 MB

The University of Melbourne

Central Kurdish TTS dataset 1.0

This dataset contains high-quality single-speaker audio recordings in Central Kurdish (ckb), intended for building Text-to-Speech (TTS) and Automatic Speech recognition (ASR) systems. The dataset comprises 2 hours and 18 minutes of aligned audio and text data.

License: CC-BY-4.0

Locale: ckb

Task: TTS

Format: wav

Size: 293.45 MB

Institute of African Digital Humanities

Laari-TTS-Dataset

The dataset contains audio and text resources on Laari, a Bantu language spoken in the Congo. The resources, which are suitable for TTS tasks and possibly ASR tasks, consist of the following: - 6,311 audio clips totalling 241 minutes and 44.97 seconds; - an audio mapping file with 5,321 lines, each beginning with the name of an audio file, followed by a tab and then the corresponding text excerpt; - two raw audio files totalling 120 minutes and 54.90 seconds; - two long audio files with their original, non-split transcription files, for a total duration of 120 minutes and 41.90 seconds.

License: NOODL-1.0

Locale: ldi

Task: ASR

Format: WAV, TRJS, TSV

Size: 568.26 MB

Institute of African Digital Humanities

Bomitaba-TTS-Dataset

The dataset comprises three components: audio clips, an audio mapping file, and raw audio of Bomitaba, a Bantu language spoken in the Congo. Each audio clip is paired with its corresponding transcription. There are 2,613 transcribed audio clips, totalling 182 minutes and 4 seconds. There are two raw audio files totalling 121 minutes and 14.24 seconds. The audio mapping file contains 2,610 lines. Each line begins with the name of an audio file, followed by a tab, then the corresponding text exce

License: NOODL-1.0

Locale: zmx

Task: TTS

Format: WAV, TSV

Size: 1.00 GB

Institute of African Digital Humanities

Suundi-TTS-Dataset

The dataset consists of paired audio and text data on Suundi (sdj), a language spoken in Congo. The audio corpus consists of 4,187 clips read by one speaker totaling 188 min 22.68 sec. The dataset also contains a mapping file of audio and text with 4,185 lines. Each line begins with the name of an audio file, followed by a tab and then the corresponding text excerpt. This dataset is suitable for TTS tasks.

License: NOODL-1.0

Locale: sdj

Task: TTS

Format: WAV, TSV

Size: 240.50 MB

Institute of African Digital Humanities

Mbosi-TTS-Dataset

The dataset consists of paired audio and text data on Mbosi (mdw), a language spoken in Congo. The audio corpus consists of 2,575 clips read by one speaker totaling 275 min 48.35 sec. The dataset also contains a mapping file of audio and text with 2,597 lines. Each line begins with the name of an audio file, followed by a tab and then the corresponding text excerpt. This dataset is suitable for TTS tasks.

License: NOODL-1.0

Locale: mdw

Task: TTS

Format: WAV, TSV

Size: 644.39 MB

Institute of African Digital Humanities

Beembe-TTS-Dataset

The dataset consists of paired audio and text data on Beembe (beq), a language spoken in Congo. The audio corpus consists of 6,933 clips read by one speaker totaling 275 min 48.35 sec. The dataset also contains a mapping file of audio and text with 4,422 lines. Each line begins with the name of an audio file, followed by a tab and then the corresponding text excerpt. This dataset is suitable for TTS tasks.

License: NOODL-1.0

Locale: beq

Task: TTS

Format: WAV, TSV

Size: 861.46 MB

Akylai

KyrgyzLLM-Bench: Kyrgyz LLM Evaluation Dataset

KyrgyzLLM-Bench is a comprehensive suite purpose-built to evaluate LLMs’ deep understanding and reasoning in Kyrgyz. It combines natively authored benchmarks with carefully translated and post-edited international tasks to provide broad and culturally grounded coverage.

License: mixed

Locale: ky

Task: LLM

Format: PARQUET

Size: 87.20 MB

Institute of African Digital Humanities

Yaka-TTS-Dataset

Paired audio and text data on Yaka (also known as West Teke), a language spoken in Congo. The audio corpus consists of 7,648 clips read by one speaker for a total duration of 344 min 40.48 sec. The dataset also contains a mapping file of audio and text with 7,648 lines. Each line begins with the name of an audio file, followed by a tab and then the corresponding text excerpt. This dataset is suitable for TTS tasks.

License: NOODL-1.0

Locale: iyx

Task: TTS

Format: WAV, TSV

Size: 1.26 GB

Institute of African Digital Humanities

Akoose-ALCAM-MultimodalDataset

This dataset comprises a datasheet of Akoose (bss) lexical entries collected from the 'Western-Bakossi' subgroup. Each entry is accompanied by illustrative sentences, French translations and a word-by-word breakdown of the Akoose sentences, as well as an equivalent breakdown in English. The resource is enriched with aligned audio recordings, making it ideal for linguistic analysis and the development of speech technology.

License: NOODL-1.0

Locale: bss

Task: NLP

Format: MP3, TSV

Size: 16.05 MB

Institute of African Digital Humanities

Kituba-TTS-Dataset

Paired audio and text data on Kituba (mkw), a language spoken in Congo. The audio corpus consists of 8,302 clips read by one speaker, totalling 350 min 11.98 sec. The dataset also contains a mapping file of audio and text with 8,173 lines. Each line begins with the name of an audio file, followed by a tab and then the corresponding text excerpt. This dataset is suitable for TTS tasks.

License: NOODL-1.0

Locale: mkw

Task: TTS

Format: WAV, TSV

Size: 553.28 MB

Taraaz

Multilingual Humanitarian Response Eval (MHRE)

This multilingual humanitarian dataset contains 655 annotated datapoints evaluating AI chatbot safety and quality in migration and asylum scenarios across four language pairs (English–Farsi (Iranian Persian), Arabic, Kurdish (Sorani), Pashto). Built from 120 expert prompts, it includes outputs from GPT-4o, Gemini 2.5 Flash, and Mistral Small. The dataset provides both human evaluations from Respond Crisis Translation native-speaker evaluators and LLM-as-judge assessments (Gemini 2.5 Flash).

License: CC-BY-NC-SA-4.0

Locale: mul

Task: LLM

Format: csv

Size: 2.15 MB

Forum for Language Initiatives

Hussain Faizy Indus Kohistani Corpus

The Indus Kohistani corpus contains around 500k tokens of folktales, stories, poetry, biographies, and conversational texts, all transcribed with a consistent community orthography. Reviewed by native speakers, the corpus offers a representative snapshot of the language’s vocabulary and grammar for linguistic and computational research.

License: CC-BY-SA-4.0

Locale: mvy

Task: NLP

Format: TXT

Size: 14.70 MB

Institute of African Digital Humanities

Ewondo-Yanda-ALCAM-MultimodalDataset

This dataset comprises a datasheet of Ewondo (Ewo) lexical entries collected in the Yanda subgroup. Each entry is accompanied by illustrative sentences, word-by-word glosses and French translations. The resource is enriched with aligned audio recordings, making it suitable for linguistic analysis and speech technology development.

License: NOODL-1.0

Locale: ewo

Task: NLP

Format: MP3, TSV

Size: 18.09 MB

Open Home Foundation

Flemishguy 1.0

Text to speech dataset for Dutch, male speaker, approximately 1 hour of read speech.

License: CC0-1.0

Locale: nl-BE

Task: TTS

Format: FLAC

Size: 73.69 MB

Open Home Foundation

Faber 1.0

Text to speech dataset for Brazilian Portuguese, male speaker, approximately 1.5 hours of read speech.

License: CC0-1.0

Locale: pt-BR

Task: TTS

Format: WEBM

Size: 30.98 MB

Open Home Foundation

Darkman 1.0

Text to speech dataset for Polish, male speaker, approximately 2 hours of read speech.

License: CC0-1.0

Locale: pl-PL

Task: TTS

Format: WEBM

Size: 40.42 MB

Open Home Foundation

Jeff 1.0

Text to speech dataset for Brazilian Portuguese, male speaker, approximately 1.5 hours of read speech.

License: CC0-1.0

Locale: pt-BR

Task: TTS

Format: WEBM

Size: 90.74 MB

Open Home Foundation

Mihai 1.0

Text to speech dataset for Romanian, male speaker, approximately 2 hours of read speech.

License: CC0-1.0

Locale: ro-RO

Task: TTS

Format: WEBM

Size: 66.31 MB