AI on the Frontline: Evaluating Large Language Models in Real-World Conflict Resolution
License:
CC-BY-4.0
Steward:
IFITTask: NLP
Release Date: 12/16/2025
Format: CSV, PDF
Size: 2.36 MB
Share
Description
Findings of an experimental evaluation to assess how leading free-access LLMs perform when asked to respond to realistic conflict resolution scenarios.
Specifics
Licensing
Creative Commons Attribution 4.0 International (CC-BY-4.0)
https://spdx.org/licenses/CC-BY-4.0.htmlProcesses
Intended Use
This dataset is intended to enable researchers, practitioners, and developers to use, mix, and build upon the data to advance the responsible use of AI in conflict prevention and resolution. The goal is to foster collaboration across disciplines so that large language models (LLMs) can evolve to meet the ethical and practical standards required in real-world mediation contexts, shaping AI as an effective and trustworthy tool for preventing and resolving political crises and armed conflicts. As LLMs are already being used to provide actionable advice in conflict zones, there’s a need to identify and address key blind spots and encouraging LLM providers to strengthen their system prompts and safeguards through shared, transparent research.
Metadata
This dataset presents findings from an experimental evaluation conducted by the Institute for Integrated Transitions (IFIT) to assess how leading free-access LLMs perform when asked to respond to realistic conflict resolution scenarios. Drawing from IFIT project work in Sudan, Mexico, and Syria, IFIT designed three user prompts and submitted them to ChatGPT (GPT-4o), Claude (Sonnet), Google Gemini (2.5 Flash), Grok (3), Mistral (Le Chat), and DeepSeek (v3).
The eighteen responses were anonymized, randomized, and evaluated by IFIT reviewers using a ten-part rubric grounded in basic conflict resolution standards and practices.
Each response was scored against predefined criteria: strong yes/yes/no/strong no. Responses were then matched back to their originating model, and cumulative scores were calculated. A strong yes corresponded to 10 points, yes to 5 points, no to 0 points, and strong no to -5 points, allowing scores to range from a minimum of -50 to a maximum of 100.
A Sounding Board composed of nine senior IFIT experts convened alongside the first- and second-round reviewers and the project team to reflect critically on the findings, analyze broader implications, and consider recommendations for both LLM developers and end-users operating in conflict-affected settings.
The dataset includes the scoring dimensions, quantitative and qualitative findings, scenario prompt language and all LLM responses.
The DOI registration ID for the report is: https://doi.org/10.5281/zenodo.16598073