Datasets

Filters:

Search results for “english”

Saraiki-English Parallel Corpus

English–Saraiki parallel corpus: 51,447 aligned sentence pairs (~0.89M words), translated by Kaleem Art Press for MT and Saraiki NLP research.

License: CC-BY-NC-4.0

Locale: mul

Task: MT

Format: CSV

Size: 1.92 MB

Effect AI

Effect AI Scripted Speech 1.0 - English

A collection of scripted spoken phrases in English.

License: CC0-1.0

Locale: en

Task: TTS

Format: CSV, MP3

Size: 663.45 MB

MEDIAMEN

English–Punjabi (Shahmukhi) Parallel Sentences Corpus (Mediamen Archives)

This parallel sentences corpus containing 30,405 aligned sentence pairs with a total of approximately 0.62 million tokens, curated from the archival materials of Mediamen (Advertising Agency). The sentences were professionally translated from English into Punjabi (Shahmukhi) and are intended to support machine translation, linguistic research, and Punjabi language technology development, particularly for real-world and contemporary language use.

License: CC-BY-NC-4.0

Locale: en-PK, pnb

Task: MT

Format: CSV

Size: 1.08 MB

Common Voice

Common Voice Scripted Speech 24.0 - English

A collection of scripted spoken phrases in English.

License: CC0-1.0

Locale: en

Task: ASR

Format: MP3

Size: 87.74 GB

MDC Community Concierge

Bangor Miami Spanish-English Corpus

Spanish-English bilingual speech corpus with 35 hours of recorded audio and 240,000 words.

License: GPL-3.0

Locale: es-US, en-US

Task: ASR

Format: MP3, CHA, TSV

Size: 1.12 GB

Common Voice

Common Voice Spontaneous Speech 1.0 - English

A collection of spontaneous spoken phrases in English.

License: CC0-1.0

Locale: en

Task: ASR

Format: MP3

Size: 128.69 MB

Common Voice

Common Voice v24 English - en-AU subset for Everything Open 2026

Common Voice v24 English filtered on the `accent` field for Australian-related accents.

License: CC0-1.0

Locale: en-AU

Task: ASR

Format: CSV, MP3

Size: 1.92 GB

Common Voice

Common Voice Scripted Speech 24.0 - Nigerian Pidgin English

A collection of scripted spoken phrases in Nigerian Pidgin English.

License: CC0-1.0

Locale: pcm

Task: ASR

Format: MP3

Size: 294.05 MB

MDC Community Concierge

Bangor Siarad Welsh-English Corpus

Welsh-English bilingual speech corpus with 40 hours of recorded audio and transcriptions making up 450,000 words

License: GPL-3.0

Locale: cym

Task: ASR

Format: MP3, CHA. TSV

Size: 2.13 GB

Open Home Foundation

Kathleen 1.0

Text to speech dataset for English, female speaker, approximately 1 hour of read speech.

License: CC0-1.0

Locale: en-US

Task: TTS

Format: FLAC

Size: 211.96 MB

Collaborative Action For Research & Development (CARD)

IBT Torwali Wordlist

The IBT Torwali Wordlist contains approximately 20,000 unique entries in Torwali (ISO 639-3: trw), an under-documented Indo-Aryan language spoken in northern Pakistan. The dataset comprises standardized lexical entries covering core vocabulary, function words, and culturally salient terms, with consistent orthography and normalization suitable for linguistic and computational use. Entries are aligned with English and Urdu glosses, and include part-of-speech tag.

License: CC-BY-SA-4.0

Locale: trw

Task: NLP

Format: CSV

Size: 312.87 KB

Mozilla Data Collective

Sermon-Malaysian-English

7 minutes of Malaysian-accented English speech

License: CC-BY-NC-4.0

Locale: en-MY

Task: ASR

Format: MP4, TXT, SRT

Size: 6.63 MB

Open Home Foundation

Joe 1.0

Text to speech dataset for English, male speaker, approximately 1 hour of read speech.

License: CC0-1.0

Locale: en-US

Task: TTS

Format: WEBM

Size: 75.78 MB

Kaleem Art Press

Multilingual Religious Parallel Corpus (Kaleem Art Press)

This dataset is a multilingual parallel sentences corpus containing 6,465 aligned sentence units with approximately 0.98 million words, curated from Kaleem Art Press archives. It includes parallel religious text data in Arabic, Urdu, Saraiki (standard and dialectal), Punjabi (Shahmukhi), and English, supporting research in machine translation, comparative linguistics, digital humanities, and low-resource language studies.

License: CC-BY-SA-4.0

Locale: mul

Task: MT

Format: CSV

Size: 2.27 MB

MDC Curators

Finance Sentences - North American Spanish

A public domain corpus of approximately 80,000 sentences (1.3M tokens) of North American Spanish in the finance domain.

License: CC0-1.0

Locale: es-US

Task: NLP

Format: TSV, JSON

Size: 18.35 MB

Community

TTS Central Javanese

This dataset consists of audio recordings and textual data in Central Javanese (Semarang dialect) including Indonesian and English code-switching.

License: CC-BY-SA-4.0

Locale: jav

Task: TTS

Format: WEBM, TSV

Size: 440.11 MB

Balochi Academy

Balochi Academy Text Corpus

This corpus contains approximately 500k tokens of text from novels, poetry, articles, riddles, and proverbs, covering both literary and traditional genres. It is intended for linguistic research, NLP tasks (e.g., language modeling and text analysis), and cultural documentation.

License: CC-BY-NC-SA-4.0

Locale: bgn

Task: NLP

Format: TXT

Size: 1.88 MB

NaijaVoices (Lanfrica Labs)

Atyap Afwan_: Preserving Tyap Through Community-Driven Speech Data

This dataset contains 98 recordings (≈1.16 hours) of everyday Tyap speech from 10 community speakers, each paired with detailed transcripts and English translations.

License: CC-BY-NC-SA-4.0

Locale: kcg

Task: NLP

Format: WAV, TXT

Size: 251.51 MB

Common Voice

Common Voice Scripted Speech 24.0 - Mina

A collection of scripted spoken phrases in Mina.

License: CC0-1.0

Locale: gej

Task: ASR

Format: MP3

Size: 215.45 MB

Balochistan Educational and Cultural Organization

Brahui Research Work Corpus

This corpus consists of research works, academic theses, and scholarly papers, comprising approximately 185,000 tokens. It covers a range of academic topics and formal registers, reflecting standardized writing practices and disciplinary conventions. The corpus is intended to support linguistic analysis, discourse studies, and computational research on academic and research-oriented text.

License: CC-BY-NC-SA-4.0

Locale: brh

Task: NLP

Format: TXT

Size: 1.13 MB

Community

Podcast Homostoria (Indonesia)

This dataset features discussions on modern media—including film, podcasts, and social media—and its connection to local customs and traditions. The conversations are primarily in Indonesian, with frequent code-switching between English and Javanese.

License: CC-BY-SA-4.0

Locale: id

Task: ASR

Format: mp3

Size: 302.97 MB

Collaborative Action For Research & Development (CARD)

Gawri (گاؤری) Magazine Corpus

The Gawri (گاؤری) Magazine Corpus is a curated collection of monthly Gawri magazine text (~67,724 tokens) that supports research and language technology.

License: CC-BY-NC-4.0

Locale: gwc

Task: NLP

Format: TXT

Size: 146.71 KB

Universitas Gadjah Mada

Jember Javanese Spontaneous Speech Corpus

A 10-hour spoken dataset of native Javanese speakers from Jember, East Java, representing the Jember dialect and Pandhalungan variety in natural speech.

License: CC-BY-NC-SA-4.0

Locale: jav

Task: ASR

Format: MP3, TSV

Size: 271.65 MB

Baloch Publishers Multan

Baloch Publishers Saraiki Literature Corpus

This corpus is a collection of one million tokens of Saraiki language. The data was produced under the Baloch Publishers over the last ten years. The corpus contains work of literature including dictionaries, short stories, novels, fiction, non-fiction, and travelogue. The data is being shared with the approval of the authors. It aims to support linguistic research, language technology development, and cultural preservation.

License: CC-BY-NC-4.0

Locale: skr

Task: NLP

Format: TXT

Size: 2.04 MB