Sentence translation difficulty in Spanish - BOUQuET

License icon

License:

CC-BY-SA-4.0

Shield icon

Steward:

MDC Curators

Task: MT

Release Date: 4/1/2026

Format: TSV

Size: 81.48 KB


Share

Description

This dataset is a collection of sentences in Spanish from the BOUQuET benchmark (total 1990 sentences) which have been annotated with sentence translation difficulty scores on a Likert scale. The annotators are speakers of three Indigenous languages of the Mexico and Guatemala and scored the sentences as part of the work on translating the benchmark into their languages.

Specifics

Licensing

Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)

https://spdx.org/licenses/CC-BY-SA-4.0.html

Considerations

Restrictions/Special Constraints

By agreeing you accept to share your contact information (email and username) with the repository authors. - I agree not to re-host BOUQuET in places where it could be picked up by web crawlers - If I evaluate using BOUQuET, I will ensure that its contents are not in the training data

Forbidden Usage

By agreeing you accept to share your contact information (email and username) with the repository authors. - I agree not to re-host BOUQuET in places where it could be picked up by web crawlers - If I evaluate using BOUQuET, I will ensure that its contents are not in the training data

Processes

Intended Use

This dataset is intended for use in evaluating models for evaluating sentence difficulty in translation and in language learning and teaching.

Metadata

This dataset is a collection of sentences in Spanish from the Bouquet benchmark (total 1990 sentences) which have been annotated with sentence translation difficulty scores on a Likert scale. The annotators are speakers of three Indigenous languages of the Mexico and Guatemala and scored the sentences as part of the work on translating the benchmark into their languages.

Scores

We asked translators to score each sentence as they were translating, on a scale of:

  • Fácil - Easy (1)

  • Mediano - Medium (2)

  • Difícil - Hard (3)

They were asked to take into account the whole translation process, both in terms of difficulty of language structures and in terms of difficulty in terms of concepts or terminology.

Annotators

  • Dificultad [azz]: Speaker of Highland Puebla Nahuatl (azz) variety from Tetela de Ocampo, bilingual in Spanish. Secondary school education (pre-bachiler).

  • Dificultad [cux]: Speakers of Cuicatec (cux) variety from Santos Reyes Pápalo, bilingual in Spanish. University education.

  • Dificultad [cak]: Speaker of Kaqchikel (cak), bilingual in Spanish. University education.

Columns:

  • Sent-ID: The sentence ID

  • Lang-ID: The language ID, in this case spa_Latn (Spanish in Latin script)

  • Domain: The domain of the sentence

  • Source sentence: The source sentence in Spanish (may be a translation)

  • Dificultad [azz]: Difficulty scores in Highland Puebla Nahuatl

  • Dificultad [cux]: Difficulty scores in Cuicatec from Santos Reyes Pápalo

  • Dificultad [cak]: Difficulty scores in Kaqchikel.

  • Average: The average score

  • Stdev: Standard deviation of the scores

Usage and restrictions:

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

By agreeing you accept to share your contact information (email and username) with the repository authors.

  • I agree not to re-host BOUQuET in places where it could be picked up by web crawlers

  • If I evaluate using BOUQuET, I will ensure that its contents are not in the training data