Datasets

Filters:
Search results for “english”
Kaleem Art Press

Saraiki-English Parallel Corpus

English–Saraiki parallel corpus: 51,447 aligned sentence pairs (~0.89M words), translated by Kaleem Art Press for MT and Saraiki NLP research.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: mul

Task Icon

Task: MT

Format Icon

Format: CSV

Size Icon

Size: 1.92 MB

Effect AI

Effect AI Scripted Speech 1.0 - English

A collection of scripted spoken phrases in English.
License Icon

License: CC0-1.0

Locale Icon

Locale: en

Task Icon

Task: TTS

Format Icon

Format: CSV, MP3

Size Icon

Size: 663.45 MB

MEDIAMEN

English–Punjabi (Shahmukhi) Parallel Sentences Corpus (Mediamen Archives)

This parallel sentences corpus containing 30,405 aligned sentence pairs with a total of approximately 0.62 million tokens, curated from the archival materials of Mediamen (Advertising Agency). The sentences were professionally translated from English into Punjabi (Shahmukhi) and are intended to support machine translation, linguistic research, and Punjabi language technology development, particularly for real-world and contemporary language use.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: en-PK, pnb

Task Icon

Task: MT

Format Icon

Format: CSV

Size Icon

Size: 1.08 MB

Common Voice

Common Voice Scripted Speech 24.0 - English

A collection of scripted spoken phrases in English.
License Icon

License: CC0-1.0

Locale Icon

Locale: en

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 87.74 GB

MDC Community Concierge

Bangor Miami Spanish-English Corpus

Spanish-English bilingual speech corpus with 35 hours of recorded audio and 240,000 words.
License Icon

License: GPL-3.0

Locale Icon

Locale: es-US, en-US

Task Icon

Task: ASR

Format Icon

Format: MP3, CHA, TSV

Size Icon

Size: 1.12 GB

Common Voice

Common Voice Spontaneous Speech 1.0 - English

A collection of spontaneous spoken phrases in English.
License Icon

License: CC0-1.0

Locale Icon

Locale: en

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 128.69 MB

Common Voice

Common Voice v24 English - en-AU subset for Everything Open 2026

Common Voice v24 English filtered on the `accent` field for Australian-related accents.
License Icon

License: CC0-1.0

Locale Icon

Locale: en-AU

Task Icon

Task: ASR

Format Icon

Format: CSV, MP3

Size Icon

Size: 1.92 GB

Common Voice

Common Voice Scripted Speech 24.0 - Nigerian Pidgin English

A collection of scripted spoken phrases in Nigerian Pidgin English.
License Icon

License: CC0-1.0

Locale Icon

Locale: pcm

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 294.05 MB

MDC Community Concierge

Bangor Siarad Welsh-English Corpus

Welsh-English bilingual speech corpus with 40 hours of recorded audio and transcriptions making up 450,000 words
License Icon

License: GPL-3.0

Locale Icon

Locale: cym

Task Icon

Task: ASR

Format Icon

Format: MP3, CHA. TSV

Size Icon

Size: 2.13 GB

Open Home Foundation

Kathleen 1.0

Text to speech dataset for English, female speaker, approximately 1 hour of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: en-US

Task Icon

Task: TTS

Format Icon

Format: FLAC

Size Icon

Size: 211.96 MB

Collaborative Action For Research & Development (CARD)

IBT Torwali Wordlist

The IBT Torwali Wordlist contains approximately 20,000 unique entries in Torwali (ISO 639-3: trw), an under-documented Indo-Aryan language spoken in northern Pakistan. The dataset comprises standardized lexical entries covering core vocabulary, function words, and culturally salient terms, with consistent orthography and normalization suitable for linguistic and computational use. Entries are aligned with English and Urdu glosses, and include part-of-speech tag.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: trw

Task Icon

Task: NLP

Format Icon

Format: CSV

Size Icon

Size: 312.87 KB

Mozilla Data Collective

Sermon-Malaysian-English

7 minutes of Malaysian-accented English speech
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: en-MY

Task Icon

Task: ASR

Format Icon

Format: MP4, TXT, SRT

Size Icon

Size: 6.63 MB

Open Home Foundation

Joe 1.0

Text to speech dataset for English, male speaker, approximately 1 hour of read speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: en-US

Task Icon

Task: TTS

Format Icon

Format: WEBM

Size Icon

Size: 75.78 MB

Kaleem Art Press

Multilingual Religious Parallel Corpus (Kaleem Art Press)

This dataset is a multilingual parallel sentences corpus containing 6,465 aligned sentence units with approximately 0.98 million words, curated from Kaleem Art Press archives. It includes parallel religious text data in Arabic, Urdu, Saraiki (standard and dialectal), Punjabi (Shahmukhi), and English, supporting research in machine translation, comparative linguistics, digital humanities, and low-resource language studies.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: mul

Task Icon

Task: MT

Format Icon

Format: CSV

Size Icon

Size: 2.27 MB

MDC Curators

Finance Sentences - North American Spanish

A public domain corpus of approximately 80,000 sentences (1.3M tokens) of North American Spanish in the finance domain.
License Icon

License: CC0-1.0

Locale Icon

Locale: es-US

Task Icon

Task: NLP

Format Icon

Format: TSV, JSON

Size Icon

Size: 18.35 MB

Community

TTS Central Javanese

This dataset consists of audio recordings and textual data in Central Javanese (Semarang dialect) including Indonesian and English code-switching.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 440.11 MB

Balochi Academy

Balochi Academy Text Corpus

This corpus contains approximately 500k tokens of text from novels, poetry, articles, riddles, and proverbs, covering both literary and traditional genres. It is intended for linguistic research, NLP tasks (e.g., language modeling and text analysis), and cultural documentation.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: bgn

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.88 MB

NaijaVoices (Lanfrica Labs)

Atyap Afwan_: Preserving Tyap Through Community-Driven Speech Data

This dataset contains 98 recordings (≈1.16 hours) of everyday Tyap speech from 10 community speakers, each paired with detailed transcripts and English translations.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: kcg

Task Icon

Task: NLP

Format Icon

Format: WAV, TXT

Size Icon

Size: 251.51 MB

Common Voice

Common Voice Scripted Speech 24.0 - Mina

A collection of scripted spoken phrases in Mina.
License Icon

License: CC0-1.0

Locale Icon

Locale: gej

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 215.45 MB

Balochistan Educational and Cultural Organization

Brahui Research Work Corpus

This corpus consists of research works, academic theses, and scholarly papers, comprising approximately 185,000 tokens. It covers a range of academic topics and formal registers, reflecting standardized writing practices and disciplinary conventions. The corpus is intended to support linguistic analysis, discourse studies, and computational research on academic and research-oriented text.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: brh

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.13 MB

Community

Podcast Homostoria (Indonesia)

This dataset features discussions on modern media—including film, podcasts, and social media—and its connection to local customs and traditions. The conversations are primarily in Indonesian, with frequent code-switching between English and Javanese.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: id

Task Icon

Task: ASR

Format Icon

Format: mp3

Size Icon

Size: 302.97 MB

Collaborative Action For Research & Development (CARD)

Gawri (گاؤری) Magazine Corpus

The Gawri (گاؤری) Magazine Corpus is a curated collection of monthly Gawri magazine text (~67,724 tokens) that supports research and language technology.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: gwc

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 146.71 KB

Universitas Gadjah Mada

Jember Javanese Spontaneous Speech Corpus

A 10-hour spoken dataset of native Javanese speakers from Jember, East Java, representing the Jember dialect and Pandhalungan variety in natural speech.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: ASR

Format Icon

Format: MP3, TSV

Size Icon

Size: 271.65 MB

Baloch Publishers Multan

Baloch Publishers Saraiki Literature Corpus

This corpus is a collection of one million tokens of Saraiki language. The data was produced under the Baloch Publishers over the last ten years. The corpus contains work of literature including dictionaries, short stories, novels, fiction, non-fiction, and travelogue. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: skr

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.04 MB