Sentence translation difficulty in English - BOUQuET

License icon

License:

CC-BY-SA-4.0

Shield icon

Steward:

MDC Curators

Task: NLP

Release Date: 4/3/2026

Format: TSV

Size: 85.61 KB


Share

Description

This dataset is a collection of sentences in English from the Bouquet benchmark (total 1990 sentences) which have been annotated with sentence translation difficulty scores on a Likert scale. The annotators are speakers of six Indigenous languages of Pakistan and scored the sentences as part of the work on translating the benchmark into their languages

Specifics

Licensing

Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)

https://spdx.org/licenses/CC-BY-SA-4.0.html

Considerations

Restrictions/Special Constraints

By agreeing you accept to share your contact information (email and username) with the repository authors. - I agree not to re-host BOUQuET in places where it could be picked up by web crawlers - If I evaluate using BOUQuET, I will ensure that its contents are not in the training data

Forbidden Usage

By agreeing you accept to share your contact information (email and username) with the repository authors. - I agree not to re-host BOUQuET in places where it could be picked up by web crawlers - If I evaluate using BOUQuET, I will ensure that its contents are not in the training data

Processes

Intended Use

This dataset is intended for use in evaluating models for evaluating sentence difficulty in translation and in language learning and teaching.

Metadata

This dataset is a collection of sentences in English from the Bouquet benchmark (total 1990 sentences) which have been annotated with sentence translation difficulty scores on a Likert scale. The annotators are speakers of six Indigenous languages of Pakistan and scored the sentences as part of the work on translating the benchmark into their languages.

### Scores

We asked translators to score each sentence as they were translating, on a scale of:

  • Very Easy (1)

  • Easy (2)

  • Moderate (3)

  • Difficult (4)

  • Very Difficult (5)

They were asked to take into account the whole translation process, both in terms of difficulty of language structures and in terms of difficulty in terms of concepts or terminology.

### Annotators

  • kxp - Speaker of Wadiyara Koli

  • bsh - Speaker of Kateviri

  • kls - Speaker of Kalasha

  • ydg - Speaker of Yagdha

  • bft - Speaker of Balti

  • skr - Speaker of Saraiki

### Columns:

  • Sent-ID: The sentence ID

  • Lang-ID: The language ID, in this case spa_Latn (English in Latin script)

  • Domain: The domain of the sentence

  • Source sentence: The source sentence in English (may be a translation)

  • kxp: Difficulty scores from speaker of Wadiyara Koli

  • bsh: Difficulty scores from speaker of Kateviri

  • kls: Difficulty scores from speaker of Kalasha

  • ydg: Difficulty scores from speaker of Yagdha

  • bft: Difficulty scores from speaker of Balti

  • skr: Difficulty scores from speaker of Saraiki.

  • Average: The average score

  • Stdev: Standard deviation of the scores

### Usage and restrictions:

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

By agreeing you accept to share your contact information (email and username) with the repository authors.

  • I agree not to re-host BOUQuET in places where it could be picked up by web crawlers

  • If I evaluate using BOUQuET, I will ensure that its contents are not in the training data