Bangor Miami Spanish-English Corpus

License icon

License:

GFDL-1.3

Shield icon

Steward:

MDC Community Concierge

Task: ASR

Release Date: 3/4/2026

Format: MP3, CHA, TSV

Size: 1.12 GB


Share

Description

The Bangor Miami Corpus of Spanish-English bilingual speech, containing around 240,000 words over 35 hours of recorded audio conversations. The dataset includes the audios, transcriptions and glosses in CHAT format, and word-level analyses of the transcriptions in .tsv files.

Specifics

Licensing

GNU Free Documentation License v1.3 only (GFDL-1.3)

https://spdx.org/licenses/GFDL-1.3-or-later.html

Considerations

Restrictions/Special Constraints

The corpus is being made available under the GNU General Public License version 3 or later (http://gnu.org/copyleft/gpl.html). Researchers who use it are requested to subscribe to the TalkBank Code of Ethics (http://talkbank.org/share/ethics.html) and acknowledge the corpus as set out below. We request that a copy of any publications that make use of this corpus be sent to us at the address specified in the Metadata section. In line with the GPLv3 licence, note that permission is NOT granted to use any of the material on this website to train an AI large language model UNLESS all the training data for that LLM is made publicly available.

Forbidden Usage

N/A

Metadata

The Bangor Miami Spanish-English corpus contains around 35 hours of recorded speech and 240,000 words of transcriptions.

Overview

This corpus, along with the Bangor Siarad and Bangor Patagonia corpora, were assembled by the former ESRC Centre for Research on Bilingualism in Theory and Practice at Bangor University by the following researchers: Prof Margaret Deuchar, Dr Diana Carter, Dr Peredur Davies, Dr Kevin Donnelly, Dr Jon Herring, Dr María del Carmen Parafita Couto, Dr Jonathan Stammers, Fraibet Aveledo, Marika Fusser, Lowri Jones, Siân Lloyd-Williams, Myfyr Prys, Elen Robert.

For detailed information about the dataset, see Miami_doc.pdf.

Citation

Please refer to the corpus as the Bangor Miami corpus, and provide a link to the website by which you accessed the corpus (http://www.talkbank.org, http://bangortalk.org.uk, or https://datacollective.mozillafoundation.org) We request that a copy of any publications that make use of this corpus be sent to us at

Margaret Deuchar ESRC Centre for Research on Bilingualism Bangor University Bangor Gwynedd LL57 2DG United Kingdom m.deuchar@bangor.ac.uk

Please also cite:

Deuchar, M., P. Davies, J. Herring, M. Parafita Couto, and D. Carter (2014). Building bilingual corpora. In: E. M. Thomas and I. Men- nen (Eds.), Advances in the Study of Bilingualism, pp. 93–111. Bristol: Multilingual Matters.

Processing

The audio files were transcribed using CHAT conventions, and have both a gloss tier and a translation tier.

Format

The dataset contains the following directories:

  • audio/ contains the mp3 files

  • chat/ contains CHAT files with transcriptions and glosses of the audio

  • word_level_tsvs/ contains the trancriptions and glosses corresponding to each audio/CHAT file, one word per line.

  • metadata/ contains speaker metadata and the speaker questionnaires.

The Patagonia_doc.pdf file contains the complete documentation for the corpus.

Funding

The researchers gratefully acknowledge the support of the Arts and Humanities Research Council (AHRC), the Economic and Social Research Council (ESRC), the Higher Education Funding Council for Wales (HEFCW) and the Welsh Government.

For further information, please contact Peredur Webb-Davies.

More details about the corpora can be found in:

  • The book chapter "Building bilingual corpora" (Margaret Deuchar, Peredur Davies, Jon Russell Herring, M. Carmen Parafita Couto and Diana Carter). In: E Môn Thomas and I Mennen (Eds.), Advances in the Study of Bilingualism (2014) pp. 93–110. Multilingual Matters.

  • The paper "Using constraint grammar in the Bangor Autoglosser to disambiguate multilingual spoken text"" (Kevin Donnelly and Margaret Deuchar). In: Constraint Grammar Applications: Proceedings of the NODALIDA 2011 Workshop, Riga, Latvia. NEALT Proceedings Series, Tartu.