Bangor Patagonia Welsh-Spanish Corpus

License icon

License:

GPL-3.0

Shield icon

Steward:

MDC Community Concierge

Task: ASR

Release Date: 3/4/2026

Format: MP3, CHA, TSV

Size: 988.02 MB


Share

Description

The Patagonia Welsh-Spanish corpus contains around 195,000 words: 78% Welsh, 17% Spanish, 5% indeterminate (i.e. the relevant word appears in the dictionaries of both main languages). The dataset includes the audios, transcriptions and glosses in CHAT format, and word-level analyses of the transcriptions in .tsv files.

Specifics

Licensing

GNU General Public License v3.0 or later (GPL-3.0)

https://spdx.org/licenses/GPL-3.0-or-later.html

Considerations

Restrictions/Special Constraints

The corpus is being made available under the GNU General Public License version 3 or later (http://gnu.org/copyleft/gpl.html). Researchers who use it are requested to subscribe to the TalkBank Code of Ethics (http://talkbank.org/share/ethics.html) and acknowledge the corpus as set out below. We request that a copy of any publications that make use of this corpus be sent to us at the address specified in the Metadata section.

Forbidden Usage

N/A

Processes

Intended Use

Research on bilingualism and language contact, code-switching ASR

Metadata

The Patagonia Welsh-Spanish corpus contains around 195,000 words: 78% Welsh, 17% Spanish, 5% indeterminate (i.e. the relevant word appears in the dictionaries of both main languages).

Overview

This corpus, along with the Bangor Siarad and Bangor Miami corpora, were assembled by the former ESRC Centre for Research on Bilingualism in Theory and Practice at Bangor University by the following researchers: Prof Margaret Deuchar, Dr Diana Carter, Dr Peredur Davies, Dr Kevin Donnelly, Dr Jon Herring, Dr María del Carmen Parafita Couto, Dr Jonathan Stammers, Fraibet Aveledo, Marika Fusser, Lowri Jones, Siân Lloyd-Williams, Myfyr Prys, Elen Robert.

For detailed information about the dataset, see Patagonia_doc.pdf.

Citation

Please refer to the corpus as the Bangor Patagonia corpus, and provide a link to the website by which you accessed the corpus, either http://www.talkbank.org, http://bangortalk.org.uk, or https://datacollective.mozillafoundation.org We request that a copy of any publications that make use of this corpus be sent to us at

Margaret Deuchar

ESRC Centre for Research on Bilingualism

Bangor University

Bangor

Gwynedd LL57 2DG

United Kingdom

m.deuchar@bangor.ac.uk

Please also cite:

Deuchar, M., P. Davies, J. Herring, M. Parafita Couto, and D. Carter (2014). Building bilingual corpora. In: E. M. Thomas and I. Men- nen (Eds.), Advances in the Study of Bilingualism, pp. 93–111. Bristol: Multilingual Matters.

Processing

The audio files were transcribed using CHAT conventions, and have both a gloss tier and a translation tier.

Format

The dataset contains the following directories:

  • audio/ contains the mp3 files

  • chat/ contains CHAT files with transcriptions and glosses of the audio

  • word_level_tsvs/ contains the trancriptions and glosses corresponding to each audio/CHAT file, one word per line.

  • metadata/ contains speaker metadata and the speaker questionnaires.

The Patagonia_doc.pdf file contains the complete documentation for the corpus.

Funding

The researchers gratefully acknowledge the support of the Arts and Humanities Research Council (AHRC), the Economic and Social Research Council (ESRC), the Higher Education Funding Council for Wales (HEFCW) and the Welsh Government.

For further information, please contact Peredur Webb-Davies.

More details about the corpora can be found in:

  • The book chapter "Building bilingual corpora" (Margaret Deuchar, Peredur Davies, Jon Russell Herring, M. Carmen Parafita Couto and Diana Carter). In: E Môn Thomas and I Mennen (Eds.), Advances in the Study of Bilingualism (2014) pp. 93–110. Multilingual Matters.

  • The paper "Using constraint grammar in the Bangor Autoglosser to disambiguate multilingual spoken text"" (Kevin Donnelly and Margaret Deuchar). In: Constraint Grammar Applications: Proceedings of the NODALIDA 2011 Workshop, Riga, Latvia. NEALT Proceedings Series, Tartu.