Sentence translation difficulty in Spanish - BOUQuET
License:
CC-BY-SA-4.0
Steward:
MDC CuratorsTask: MT
Release Date: 4/1/2026
Format: TSV
Size: 81.48 KB
Share
Description
This dataset is a collection of sentences in Spanish from the BOUQuET benchmark (total 1990 sentences) which have been annotated with sentence translation difficulty scores on a Likert scale. The annotators are speakers of three Indigenous languages of the Mexico and Guatemala and scored the sentences as part of the work on translating the benchmark into their languages.
Specifics
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
By agreeing you accept to share your contact information (email and username) with the repository authors. - I agree not to re-host BOUQuET in places where it could be picked up by web crawlers - If I evaluate using BOUQuET, I will ensure that its contents are not in the training data
Forbidden Usage
By agreeing you accept to share your contact information (email and username) with the repository authors. - I agree not to re-host BOUQuET in places where it could be picked up by web crawlers - If I evaluate using BOUQuET, I will ensure that its contents are not in the training data
Processes
Intended Use
This dataset is intended for use in evaluating models for evaluating sentence difficulty in translation and in language learning and teaching.
Metadata
This dataset is a collection of sentences in Spanish from the Bouquet benchmark (total 1990 sentences) which have been annotated with sentence translation difficulty scores on a Likert scale. The annotators are speakers of three Indigenous languages of the Mexico and Guatemala and scored the sentences as part of the work on translating the benchmark into their languages.
Scores
We asked translators to score each sentence as they were translating, on a scale of:
Fácil - Easy (1)
Mediano - Medium (2)
Difícil - Hard (3)
They were asked to take into account the whole translation process, both in terms of difficulty of language structures and in terms of difficulty in terms of concepts or terminology.
Annotators
Dificultad [azz]: Speaker of Highland Puebla Nahuatl (
azz) variety from Tetela de Ocampo, bilingual in Spanish. Secondary school education (pre-bachiler).Dificultad [cux]: Speakers of Cuicatec (
cux) variety from Santos Reyes Pápalo, bilingual in Spanish. University education.Dificultad [cak]: Speaker of Kaqchikel (
cak), bilingual in Spanish. University education.
Columns:
Sent-ID: The sentence IDLang-ID: The language ID, in this casespa_Latn(Spanish in Latin script)Domain: The domain of the sentenceSource sentence: The source sentence in Spanish (may be a translation)Dificultad [azz]: Difficulty scores in Highland Puebla NahuatlDificultad [cux]: Difficulty scores in Cuicatec from Santos Reyes PápaloDificultad [cak]: Difficulty scores in Kaqchikel.Average: The average scoreStdev: Standard deviation of the scores
Usage and restrictions:
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
By agreeing you accept to share your contact information (email and username) with the repository authors.
I agree not to re-host BOUQuET in places where it could be picked up by web crawlers
If I evaluate using BOUQuET, I will ensure that its contents are not in the training data