Datasets

Filters:
Effect AI

Effect AI Scripted Speech 1.0 - English

A collection of scripted spoken phrases in English.
License Icon

License: CC0-1.0

Locale Icon

Locale: en

Task Icon

Task: TTS

Format Icon

Format: CSV, MP3

Size Icon

Size: 663.45 MB

Amara Hub

DataTrust Africa: Speech Corpus of Public Radio Recordings from Northern Uganda

This is an open-access corpus of short clips of public radio content from Mega 100 FM, Q FM, Radio Pacis and Radio Rupiny in Northern Uganda. As of now, the online corpus has over 350 clips of recordings in English. We also hope to add finely-annotated transcripts to them. The dataset is for use in NLP research and non-commercial use. Upcoming datasets to look out for from Amara Hub are public radio recordings in other languages spoken in the region like Acholi, Lango, Lugbara and Akaramajong.
License Icon

License: NOODL-1.0

Locale Icon

Locale: en-US

Task Icon

Task: NLP

Format Icon

Format: MP3

Size Icon

Size: 179.82 MB

Digital Divide Data

Khmer ASR Cultural Dataset

37.62 hours manually curated speech-text pairs by native speakers in Khmer language about Cambodian cultural topics. On average, each recording is 8.29 seconds with the standard deviation of 3.87. Speaker metadata (gender, age group, and origin city) is provided.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: khm

Task Icon

Task: ASR

Format Icon

Format: WAV

Size Icon

Size: 12.59 GB

PT Pancaran Semangat Jaya

Corpus of Panjebar Semangat Javanese-Language Magazine

This dataset is a TXT-format collection compiled from three years of popular articles published in the Javanese-language weekly magazine Panjebar Semangat. It compiles widely read, non-academic Javanese texts reflecting contemporary themes and language use.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: Jav

Task Icon

Task: OTH

Format Icon

Format: TXT

Size Icon

Size: 4.31 MB

Center za jezikovne vire in tehnologije Univerze v Ljubljani

SI-NLI

SI-NLI (Slovene Natural Language Inference Dataset) contains 5,937 human-created Slovene sentence pairs (premise and hypothesis) that are manually labeled with the labels "entailment", "contradiction", and "neutral". We created the dataset using sentences that appear in the Slovenian reference corpus ccKres (http://hdl.handle.net/11356/1034). Annotators were tasked to modify the hypothesis in a candidate pair in a way that reflects one of the labels. The dataset is balanced since the annotators created three modifications (entailment, contradiction, neutral) for each candidate sentence pair. The dataset is split into train, validation, and test sets, with sizes of 4,392, 547, and 998. We used Slovenian pre-trained language models to create splits, thereby ensuring that difficult and easy instances are evenly distributed in all three subsets. The dataset is released in a tabular TSV format. The README.txt file contains a description of the attributes. Only the hypothesis and premise are given in the test set (i.e. no annotations) since SI-NLI is integrated into the Slovene evaluation framework SloBENCH (https://slobench.cjvt.si/). If you use the dataset to train your models, please consider submitting the test set predictions to SloBENCH to get the evaluation score and see how it compares to others.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: sl

Task Icon

Task: NLU

Format Icon

Format: TSV

Size Icon

Size: 392.44 KB

Pro Svizra Rumantscha

Vallader Newspaper Corpus

6.2 million tokens in the Vallader variety of Romansh from the daily newspaper ”La Quotidiana”.
License Icon

License: CC0-1.0

Locale Icon

Locale: rm-vallader

Task Icon

Task: OTH

Format Icon

Format: TSV

Size Icon

Size: 18.71 MB

Kaleem Art Press

Multilingual Religious Parallel Corpus (Kaleem Art Press)

This dataset is a multilingual parallel sentences corpus containing 6,465 aligned sentence units with approximately 0.98 million words, curated from Kaleem Art Press archives. It includes parallel religious text data in Arabic, Urdu, Saraiki (standard and dialectal), Punjabi (Shahmukhi), and English, supporting research in machine translation, comparative linguistics, digital humanities, and low-resource language studies.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: mul

Task Icon

Task: MT

Format Icon

Format: CSV

Size Icon

Size: 2.27 MB

Sindh Line Publishers

Sindh Line Publishers

The corpus contains 1.029 million tokens from the Sindh Line a Sindhi Newspaper published from the year 2024-2025. The text consists of the complete newspaper content including headlines, editorials, finance news and advertisements. The newspaper published in Karachi, Pakistan on daily basis
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: snd

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.22 MB

Institute of African Digital Humanities

Spoken-Congolese-French-Dataset

The dataset consists of paired audio and text resources on spoken French from the Republic of the Congo. The audio files were extracted from longer recordings of semi-guided interviews conducted in Brazzaville, and orthographic transcriptions were added. The long audio recordings and their corresponding TRJS transcription files were automatically clipped alongside their respective transcriptions. The dataset comprises ten folders containing audio files and ten audio/text mapping files.
License Icon

License: NOODL-1.0

Locale Icon

Locale: fr-CG

Task Icon

Task: NLP

Format Icon

Format: MP3, WAV, TSV

Size Icon

Size: 3.44 GB

Institute of African Digital Humanities

Ewondo_Mbida-Mbani_ALCAM-MultimodalDataset

This dataset comprises a datasheet of Ewondo (Ewo) lexical entries collected in the speech area known as Mbida Mbani. Each entry is accompanied by illustrative sentences, word-by-word glosses and French translations. The resource is enriched with aligned audio recordings, making it suitable for linguistic analysis and speech technology development.
License Icon

License: NOODL-1.0

Locale Icon

Locale: ewo

Task Icon

Task: NLP

Format Icon

Format: MP3, TSV

Size Icon

Size: 19.25 MB

Balochi Academy

Balochi Academy Text Corpus

This corpus contains approximately 500k tokens of text from novels, poetry, articles, riddles, and proverbs, covering both literary and traditional genres. It is intended for linguistic research, NLP tasks (e.g., language modeling and text analysis), and cultural documentation.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: bgn

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.88 MB

Institute of African Digital Humanities

Mada Narratives

This dataset contains 17 transcribed oral narratives in Mada (mxu), a language belonging to the Afro-Asiatic family that is spoken in Cameroon. The texts, derived from audio recordings of oral literature, reflect natural spoken discourse. This dataset can be used for language modelling, text analysis and other natural language processing (NLP) tasks.
License Icon

License: NOODL-1.0

Locale Icon

Locale: mxu

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 65.04 KB

Pro Svizra Rumantscha

Surmiran Newspaper Corpus

2.9 million tokens in the Surmiran variety of Romansh from the daily newspaper “La Quotidiana”.
License Icon

License: CC0-1.0

Locale Icon

Locale: rm-surmiran

Task Icon

Task: OTH

Format Icon

Format: TSV

Size Icon

Size: 11.89 MB

Maseno Centre for Applied Artificial Intelligence (MCAAI)

DhoNam: Dholuo Speech dataset

DhoNam: Dholuo Speech dataset is a speech corpus designed to supercharge Automatic Speech Recognition (ASR) and other speech technologies for Dholuo, one of Kenya’s major indigenous languages. This dataset contains native-speaker audio recordings collected through a platform where users read aloud a displayed sentence. The dataset includes the audio recordings and the corresponding prompt/sentence that was read.
License Icon

License: NOODL-1.0

Locale Icon

Locale: Luo

Task Icon

Task: ASR

Format Icon

Format: WEBM

Size Icon

Size: 2.49 GB

Amnesia

Archivo de la Comisionada María de los Ángeles Guzmán García (COTAI Nuevo León / InfoNL)

Este archivo preserva la memoria institucional y académica de la gestión de la Dra. María de los Ángeles Guzmán García como Comisionada de la Comisión de Transparencia y Acceso a la Información del Estado de Nuevo León (COTAI / INFONL) durante el periodo 2018-2025. El dataset consolida el legado documental de una de las perfiles más técnicos y académicos del Sistema Nacional de Transparencia. Doctora en Derecho Constitucional por la Universidad Complutense de Madrid, la Comisionada Guzmán García
License Icon

License: CC-BY-4.0

Locale Icon

Locale: es-MX

Task Icon

Task: NLP

Format Icon

Format: ZIP, PDF, CSV, XLSX

Size Icon

Size: 866.15 MB

IFIT

Improving AI Conflict Resolution Capacities: A Prompts-Based Evaluation

Findings of a follow-up study assessing how leading free-access LLMs perform when adding instructions directing them to apply basic conflict-resolution practices.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: mul

Task Icon

Task: NLP

Format Icon

Format: CSV, PDF

Size Icon

Size: 1.46 MB

IFIT

AI on the Frontline: Evaluating Large Language Models in Real-World Conflict Resolution

Findings of an experimental evaluation to assess how leading free-access LLMs perform when asked to respond to realistic conflict resolution scenarios.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: en-US

Task Icon

Task: NLP

Format Icon

Format: CSV, PDF

Size Icon

Size: 2.36 MB

Institute of African Digital Humanities

Teke-Laali-TTS-Dataset

The dataset contains paired audio and text resources for Teke-Laali, a Bantu language spoken in the Congo. It consists of seven folders containing a total of 9,069 audio clips from raw audio recordings, with a total duration of 7:01:50.126 (HH:MM:SS.mmm). Additionally, there are seven audio/text mapping files containing a total of 9,069 lines. The dataset is suitable for TTS tasks.
License Icon

License: NOODL-1.0

Locale Icon

Locale: lli

Task Icon

Task: TTS

Format Icon

Format: WAV, TSV

Size Icon

Size: 635.61 MB

Institute of African Digital Humanities

Bati-MultiDialectalASR-Dataset

This dataset contains paired audio and text resources for three Bati dialects (Kelleng, Mbougue, and Nyambat), which belong to the Yambasa group of Bantu languages found in Cameroon. It contains 13,344 audio clips totalling 6 hours, 8 minutes and 12.286 seconds and 44 audio/text mapping files totalling 13,346 lines. Due to its cross-dialectal nature, the dataset is suitable for multilingual automatic speech recognition tasks.
License Icon

License: NOODL-1.0

Locale Icon

Locale: btc

Task Icon

Task: ASR

Format Icon

Format: WAV, TSV

Size Icon

Size: 3.27 GB

Common Voice

Mozilla Common Voice Text Language Identification dataset

A dataset for text-based language identification of 19 Million sentences from over 300 languages taken from Mozilla Common Voice scripted (v23) and spontaneous (v1) speech projects.
License Icon

License: CC0-1.0

Locale Icon

Locale: mul

Task Icon

Task: NLP

Format Icon

Format: TSV

Size Icon

Size: 950.41 MB

The University of Melbourne

Central Kurdish TTS dataset 1.0

This dataset contains high-quality single-speaker audio recordings in Central Kurdish (ckb), intended for building Text-to-Speech (TTS) and Automatic Speech recognition (ASR) systems. The dataset comprises 2 hours and 18 minutes of aligned audio and text data.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: ckb

Task Icon

Task: TTS

Format Icon

Format: wav

Size Icon

Size: 293.45 MB

Institute of African Digital Humanities

Laari-TTS-Dataset

The dataset contains audio and text resources on Laari, a Bantu language spoken in the Congo. The resources, which are suitable for TTS tasks and possibly ASR tasks, consist of the following: - 6,311 audio clips totalling 241 minutes and 44.97 seconds; - an audio mapping file with 5,321 lines, each beginning with the name of an audio file, followed by a tab and then the corresponding text excerpt; - two raw audio files totalling 120 minutes and 54.90 seconds; - two long audio files with their original, non-split transcription files, for a total duration of 120 minutes and 41.90 seconds.
License Icon

License: NOODL-1.0

Locale Icon

Locale: ldi

Task Icon

Task: ASR

Format Icon

Format: WAV, TRJS, TSV

Size Icon

Size: 568.26 MB

Institute of African Digital Humanities

Bomitaba-TTS-Dataset

The dataset comprises three components: audio clips, an audio mapping file, and raw audio of Bomitaba, a Bantu language spoken in the Congo. Each audio clip is paired with its corresponding transcription. There are 2,613 transcribed audio clips, totalling 182 minutes and 4 seconds. There are two raw audio files totalling 121 minutes and 14.24 seconds. The audio mapping file contains 2,610 lines. Each line begins with the name of an audio file, followed by a tab, then the corresponding text exce
License Icon

License: NOODL-1.0

Locale Icon

Locale: zmx

Task Icon

Task: TTS

Format Icon

Format: WAV, TSV

Size Icon

Size: 1.00 GB

Institute of African Digital Humanities

Suundi-TTS-Dataset

The dataset consists of paired audio and text data on Suundi (sdj), a language spoken in Congo. The audio corpus consists of 4,187 clips read by one speaker totaling 188 min 22.68 sec. The dataset also contains a mapping file of audio and text with 4,185 lines. Each line begins with the name of an audio file, followed by a tab and then the corresponding text excerpt. This dataset is suitable for TTS tasks.
License Icon

License: NOODL-1.0

Locale Icon

Locale: sdj

Task Icon

Task: TTS

Format Icon

Format: WAV, TSV

Size Icon

Size: 240.50 MB