Heroes English-Spanish Dubbed Movie Speech Corpus

License:

CC-BY-SA-4.0

Steward:

Community

Task: NLP

Release Date: 3/23/2026

Format: wav, csv, txt

Size: 1.68 GB

Description

Heroes corpus contains mapped bilingual (English and Spanish) speech segments from the TV series Heroes. It contains 7000 single speaker speech segments extracted from the original and Spanish dubbed version of 21 episodes. Audio segments are accompanied with subtitle transcriptions and word-level prosodic/paralinguistic information. Each episode directory contains word-level and segment-level information of the whole episode and also parallel samples extracted under segments_eng and segments_spa subdirectories. Each sample is stored as a wave audio file, text file and a csv file containing word timing information and word-level paralinguistic and prosodic features (speaker id, mean f0, mean intensity).

Specifics

Licensing

Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)

https://spdx.org/licenses/CC-BY-SA-4.0.html

Considerations

Restrictions/Special Constraints

This dataset is intended solely for research and scientific purposes.

Forbidden Usage

No forbidden usages.

Processes

Ethical Review

The Heroes corpus consists of short, isolated speech segments (averaging about 2.5 seconds each) extracted from dubbed TV episodes, annotated with prosodic and linguistic features for computational linguistics research. The segments are decontextualized and cannot be used to reconstruct the original audiovisual content — they serve a fundamentally different, transformative purpose from the original work. The corpus was created at Universitat Pompeu Fabra and published through their institutional repository under a CC BY-SA 4.0 license. The methodology and fair use justification are described in the peer-reviewed paper "Corpora compilation for prosody-informed speech processing" (Öktem, Farrús & Bonafonte, 2021, Language Resources & Evaluation, Springer). Under EU Directive 2019/790 (Article 3), text and data mining for scientific research purposes by research organizations is permitted, which applies here. The dataset would only be made available for research purposes. Given the nature of the extracted segments — short, fragmented, prosodically annotated, and representing a small fraction of the source material — there is no risk of substitution for the original copyrighted work.

Intended Use

Research on dubbing, automatic dubbing, spoken machine translation

Metadata

This dataset (referred to as "Heroes Corpus") contains short audio and text excerpts (2.44 seconds in average) from the TV series "Heroes" (Copyright Universal Media Studios (2006-2007,2007-2008, 2008-2009)). It is compiled and used only for research purposes. Creation of this dataset is partially financed by the UPF DTIC-Maria de Maeztu Strategic Program.

This dataset is created with automated toolkit movie2parallelDB. The authors provide it as it is and cannot be held responsible for possible errors.

For more information and citation:

Öktem A, Farrús M, Bonafonte A. Bilingual Prosodic Dataset Compilation for Spoken Language Translation. IberSPEECH 2018; 2018 Nov 21-23; Barcelona, Spain

Alp Öktem, Mireia Farrús, Antonio Bonafonte Corpora compilation for prosody-informed speech processing Lang Resources & Evaluation 55, 925–946 (2021) https://doi.org/10.1007/s10579-021-09556-2