SI-NLI

License icon

License:

CC-BY-NC-SA-4.0

Shield icon

Steward:

Center za jezikovne vire in tehnologije Univerze v Ljubljani

Task: NLU

Release Date: 1/9/2026

Format: TSV

Size: 392.44 KB


Share

Description

SI-NLI (Slovene Natural Language Inference Dataset) contains 5,937 human-created Slovene sentence pairs (premise and hypothesis) that are manually labeled with the labels "entailment", "contradiction", and "neutral". We created the dataset using sentences that appear in the Slovenian reference corpus ccKres (http://hdl.handle.net/11356/1034). Annotators were tasked to modify the hypothesis in a candidate pair in a way that reflects one of the labels. The dataset is balanced since the annotators created three modifications (entailment, contradiction, neutral) for each candidate sentence pair. The dataset is split into train, validation, and test sets, with sizes of 4,392, 547, and 998. We used Slovenian pre-trained language models to create splits, thereby ensuring that difficult and easy instances are evenly distributed in all three subsets. The dataset is released in a tabular TSV format. The README.txt file contains a description of the attributes. Only the hypothesis and premise are given in the test set (i.e. no annotations) since SI-NLI is integrated into the Slovene evaluation framework SloBENCH (https://slobench.cjvt.si/). If you use the dataset to train your models, please consider submitting the test set predictions to SloBENCH to get the evaluation score and see how it compares to others.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

Due to the source of the text pairs (ccKres) the dataset is licensed as CC-BY-NC-SA, and intended for non-commercial use only. The uploaded dataset contains an unlabeled test set. The test set labels are hidden to reduce the chance of being included in large language models' training dataset. Testing models is enabled via the SloBench evaluation platform for Slovenian NLP tasks (https://slobench.cjvt.si/).

Forbidden Usage

Commercial applications.

Processes

Intended Use

The dataset is intended for training/evaluating natural language inference models.

Metadata