Sentence translation difficulty in English - BOUQuET

License:

CC-BY-SA-4.0

Steward:

MDC Curators

Task: NLP

Release Date: 4/3/2026

Format: TSV

Size: 85.61 KB

Description

This dataset is a collection of sentences in English from the Bouquet benchmark (total 1990 sentences) which have been annotated with sentence translation difficulty scores on a Likert scale. The annotators are speakers of six Indigenous languages of Pakistan and scored the sentences as part of the work on translating the benchmark into their languages

Specifics

Licensing

Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)

https://spdx.org/licenses/CC-BY-SA-4.0.html

Considerations

Restrictions/Special Constraints

By agreeing you accept to share your contact information (email and username) with the repository authors. - I agree not to re-host BOUQuET in places where it could be picked up by web crawlers - If I evaluate using BOUQuET, I will ensure that its contents are not in the training data

Forbidden Usage

Processes

Intended Use

This dataset is intended for use in evaluating models for evaluating sentence difficulty in translation and in language learning and teaching.

Metadata

### Scores

We asked translators to score each sentence as they were translating, on a scale of:

Very Easy (1)
Easy (2)
Moderate (3)
Difficult (4)
Very Difficult (5)

They were asked to take into account the whole translation process, both in terms of difficulty of language structures and in terms of difficulty in terms of concepts or terminology.

### Annotators

kxp - Speaker of Wadiyara Koli
bsh - Speaker of Kateviri
kls - Speaker of Kalasha
ydg - Speaker of Yagdha
bft - Speaker of Balti
skr - Speaker of Saraiki

### Columns:

Sent-ID: The sentence ID
Lang-ID: The language ID, in this case spa_Latn (English in Latin script)
Domain: The domain of the sentence
Source sentence: The source sentence in English (may be a translation)
kxp: Difficulty scores from speaker of Wadiyara Koli
bsh: Difficulty scores from speaker of Kateviri
kls: Difficulty scores from speaker of Kalasha
ydg: Difficulty scores from speaker of Yagdha
bft: Difficulty scores from speaker of Balti
skr: Difficulty scores from speaker of Saraiki.
Average: The average score
Stdev: Standard deviation of the scores

### Usage and restrictions:

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

By agreeing you accept to share your contact information (email and username) with the repository authors.

I agree not to re-host BOUQuET in places where it could be picked up by web crawlers
If I evaluate using BOUQuET, I will ensure that its contents are not in the training data