AI on the Frontline: Evaluating Large Language Models in Real-World Conflict Resolution

Specifics

Licensing

Creative Commons Attribution 4.0 International (CC-BY-4.0)

https://spdx.org/licenses/CC-BY-4.0.html

Processes

Intended Use

This dataset is intended to enable researchers, practitioners, and developers to use, mix, and build upon the data to advance the responsible use of AI in conflict prevention and resolution. The goal is to foster collaboration across disciplines so that large language models (LLMs) can evolve to meet the ethical and practical standards required in real-world mediation contexts, shaping AI as an effective and trustworthy tool for preventing and resolving political crises and armed conflicts. As LLMs are already being used to provide actionable advice in conflict zones, there’s a need to identify and address key blind spots and encouraging LLM providers to strengthen their system prompts and safeguards through shared, transparent research.

Metadata

This dataset presents findings from an experimental evaluation conducted by the Institute for Integrated Transitions (IFIT) to assess how leading free-access LLMs perform when asked to respond to realistic conflict resolution scenarios. Drawing from IFIT project work in Sudan, Mexico, and Syria, IFIT designed three user prompts and submitted them to ChatGPT (GPT-4o), Claude (Sonnet), Google Gemini (2.5 Flash), Grok (3), Mistral (Le Chat), and DeepSeek (v3).

The eighteen responses were anonymized, randomized, and evaluated by IFIT reviewers using a ten-part rubric grounded in basic conflict resolution standards and practices.

Each response was scored against predefined criteria: strong yes/yes/no/strong no. Responses were then matched back to their originating model, and cumulative scores were calculated. A strong yes corresponded to 10 points, yes to 5 points, no to 0 points, and strong no to -5 points, allowing scores to range from a minimum of -50 to a maximum of 100.

A Sounding Board composed of nine senior IFIT experts convened alongside the first- and second-round reviewers and the project team to reflect critically on the findings, analyze broader implications, and consider recommendations for both LLM developers and end-users operating in conflict-affected settings.

The dataset includes the scoring dimensions, quantitative and qualitative findings, scenario prompt language and all LLM responses.

The DOI registration ID for the report is: https://doi.org/10.5281/zenodo.16598073

AI on the Frontline: Evaluating Large Language Models in Real-World Conflict Resolution

Description

Specifics

Processes

Metadata