Improving AI Conflict Resolution Capacities: A Prompts-Based Evaluation

License icon

License:

CC-BY-4.0

Shield icon

Steward:

IFIT

Task: NLP

Release Date: 12/16/2025

Format: CSV, PDF

Size: 1.46 MB


Share

Description

Findings of a follow-up study assessing how leading free-access LLMs perform when adding instructions directing them to apply basic conflict-resolution practices.

Specifics

Licensing

Creative Commons Attribution 4.0 International (CC-BY-4.0)

https://spdx.org/licenses/CC-BY-4.0.html

Processes

Intended Use

This dataset is intended to enable researchers, practitioners, and developers to use, mix, and build upon the data to advance the responsible use of AI in conflict prevention and resolution. The goal is to foster collaboration across disciplines so that large language models (LLMs) can evolve to meet the ethical and practical standards required in real-world mediation contexts, shaping AI as an effective and trustworthy tool for preventing and resolving political crises and armed conflicts. As LLMs are already being used to provide actionable advice in conflict zones, there’s a need to identify and address key blind spots and encouraging LLM providers to strengthen their system prompts and safeguards through shared, transparent research.

Metadata

This dataset presents the findings of a re-test of two scenario prompts reflecting real-world situations in Syria and Mexico. These prompts were submitted—with an added instruction paragraph—to the same six free-access LLMs evaluated in the original study: ChatGPT (GPT-4o), Claude (Sonnet), Google Gemini (2.5 Flash), Grok (3), Mistral (Le Chat), and DeepSeek (v3). The added paragraph, referred to throughout the report as the “user prompt addition”, instructed models to follow a series of basic conflict resolution practices before offering advice.

The twelve responses were anonymized, randomized, and evaluated by IFIT reviewers using a ten-part rubric grounded in basic conflict resolution standards and practices.

Each response was scored against predefined criteria: strong yes/yes/no/strong no. Responses were then matched back to their originating model, and cumulative scores were calculated. A strong yes corresponded to 10 points, yes to 5 points, no to 0 points, and strong no to -5 points, allowing scores to range from a minimum of -50 to a maximum of 100.

The dataset includes the scoring dimensions, quantitative and qualitative findings, scenario prompt language and all LLM responses.

The DOI registration ID for the report is: https://doi.org/10.5281/zenodo.16810663