Bangor Patagonia Welsh-Spanish Corpus
License:
GPL-3.0
Steward:
MDC Community ConciergeTask: ASR
Release Date: 3/4/2026
Format: MP3, CHA, TSV
Size: 988.02 MB
Share
Description
The Patagonia Welsh-Spanish corpus contains around 195,000 words: 78% Welsh, 17% Spanish, 5% indeterminate (i.e. the relevant word appears in the dictionaries of both main languages). The dataset includes the audios, transcriptions and glosses in CHAT format, and word-level analyses of the transcriptions in .tsv files.
Specifics
Licensing
GNU General Public License v3.0 or later (GPL-3.0)
https://spdx.org/licenses/GPL-3.0-or-later.htmlConsiderations
Restrictions/Special Constraints
The corpus is being made available under the GNU General Public License version 3 or later (http://gnu.org/copyleft/gpl.html). Researchers who use it are requested to subscribe to the TalkBank Code of Ethics (http://talkbank.org/share/ethics.html) and acknowledge the corpus as set out below. We request that a copy of any publications that make use of this corpus be sent to us at the address specified in the Metadata section.
Forbidden Usage
N/A
Processes
Intended Use
Research on bilingualism and language contact, code-switching ASR
Metadata
The Patagonia Welsh-Spanish corpus contains around 195,000 words: 78% Welsh, 17% Spanish, 5% indeterminate (i.e. the relevant word appears in the dictionaries of both main languages).
Overview
This corpus, along with the Bangor Siarad and Bangor Miami corpora, were assembled by the former ESRC Centre for Research on Bilingualism in Theory and Practice at Bangor University by the following researchers: Prof Margaret Deuchar, Dr Diana Carter, Dr Peredur Davies, Dr Kevin Donnelly, Dr Jon Herring, Dr María del Carmen Parafita Couto, Dr Jonathan Stammers, Fraibet Aveledo, Marika Fusser, Lowri Jones, Siân Lloyd-Williams, Myfyr Prys, Elen Robert.
For detailed information about the dataset, see Patagonia_doc.pdf.
Citation
Please refer to the corpus as the Bangor Patagonia corpus, and provide a link to the website by which you accessed the corpus, either http://www.talkbank.org, http://bangortalk.org.uk, or https://datacollective.mozillafoundation.org We request that a copy of any publications that make use of this corpus be sent to us at
Margaret Deuchar
ESRC Centre for Research on Bilingualism
Bangor University
Bangor
Gwynedd LL57 2DG
United Kingdom
Please also cite:
Deuchar, M., P. Davies, J. Herring, M. Parafita Couto, and D. Carter (2014). Building bilingual corpora. In: E. M. Thomas and I. Men- nen (Eds.), Advances in the Study of Bilingualism, pp. 93–111. Bristol: Multilingual Matters.
Processing
The audio files were transcribed using CHAT conventions, and have both a gloss tier and a translation tier.
Format
The dataset contains the following directories:
audio/contains the mp3 fileschat/contains CHAT files with transcriptions and glosses of the audioword_level_tsvs/contains the trancriptions and glosses corresponding to each audio/CHAT file, one word per line.metadata/contains speaker metadata and the speaker questionnaires.
The Patagonia_doc.pdf file contains the complete documentation for the corpus.
Funding
The researchers gratefully acknowledge the support of the Arts and Humanities Research Council (AHRC), the Economic and Social Research Council (ESRC), the Higher Education Funding Council for Wales (HEFCW) and the Welsh Government.
For further information, please contact Peredur Webb-Davies.
More details about the corpora can be found in:
The book chapter "Building bilingual corpora" (Margaret Deuchar, Peredur Davies, Jon Russell Herring, M. Carmen Parafita Couto and Diana Carter). In: E Môn Thomas and I Mennen (Eds.), Advances in the Study of Bilingualism (2014) pp. 93–110. Multilingual Matters.
The paper "Using constraint grammar in the Bangor Autoglosser to disambiguate multilingual spoken text"" (Kevin Donnelly and Margaret Deuchar). In: Constraint Grammar Applications: Proceedings of the NODALIDA 2011 Workshop, Riga, Latvia. NEALT Proceedings Series, Tartu.