Multilingual Humanitarian Response Eval (MHRE)
License:
CC-BY-NC-SA-4.0
Steward:
Taraaz
Task: LLM
Release Date: 12/8/2025
Format: csv
Size: 2.15 MB
Description
This multilingual humanitarian dataset contains 655 annotated datapoints evaluating AI chatbot safety and quality in migration and asylum scenarios across four language pairs (English–Farsi (Iranian Persian), Arabic, Kurdish (Sorani), Pashto). Built from 120 expert prompts, it includes outputs from GPT-4o, Gemini 2.5 Flash, and Mistral Small. The dataset provides both human evaluations from Respond Crisis Translation native-speaker evaluators and LLM-as-judge assessments (Gemini 2.5 Flash).
Specifics
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
- The dataset is provided exclusively for **non-commercial research purposes**, and the human evaluation data must **not** be used for major AI labs (companies with annual revenue more than ten million USD)'s training, fine-tuning, or RLHF purposes. - Any research, publication, model evaluation, or dataset derivative that uses this data must include full attribution to: - [Multilingual AI Lab](https://www.multilingualailab.com/) - [Respond Crisis Translation](https://respondcrisistranslation.org/) - Users of this dataset must acknowledge that: - Human interpretation and translation remain essential in humanitarian, refugee, and asylum contexts. - AI systems evaluated or trained using this dataset must **not** be used to replace, eliminate, or diminish the critical role of human interpreters, cultural mediators, or language specialists who support refugees, asylum seekers, and migrants. - The dataset must **not** be used to develop automated systems intended to remove humans from life-critical interpretation workflows, including: - asylum interviews - refugee case processing - emergency or crisis translation - legal, medical, or other high-stakes humanitarian communication - By using this dataset, you agree **not** to use it in ways that undermine, replace, or devalue the work of community interpreters and translators, especially those serving displaced and linguistically marginalized populations.
Forbidden Usage
- The dataset is provided exclusively for **non-commercial research purposes**, and the **human** evaluation data must **not** be used for major AI labs (companies with annual revenue more than ten million USD)'s training, fine-tuning, or RLHF purposes. - The dataset must **not** be used to develop automated systems intended to remove humans from life-critical interpretation workflows, including: - asylum interviews - refugee case processing - emergency or crisis translation - legal, medical, or other high-stakes humanitarian communication By using this dataset, you agree **not** to use it in ways that undermine, replace, or devalue the work of community interpreters and translators, especially those serving displaced and linguistically marginalized populations.
Processes
Intended Use
This dataset can be used by anyone conducting research on multilingual inconsistencies in large language model responses, and especially for: - Helping close linguistic accessibility gaps in AI-enabled services - Pinpointing harms and vulnerabilities affecting refugees and asylum seekers - Holding AI system developers accountable across languages - Strengthening the role of human interpreters—not replacing them - Advancing research on the strengths and limitations of LLM-as-a-judge systems and human evaluation methods Some potential research directions include: - Conducting LLM model-to-model comparisons of responses - Analyzing human vs. LLM-as-a-judge scoring patterns - Developing benchmarks based on the original scenarios - Replicating the study for languages not covered in this work - Performing context-level analysis to assess which criteria or domains (e.g., health questions, asylum process questions, financial advice, digital security, etc.) receive higher or lower scores
Metadata
This multilingual humanitarian evaluation dataset contains 655 annotated datapoints assessing AI chatbot safety and quality in migration and asylum scenarios. Built from 120 expert-crafted prompts based on real-world information needs, it evaluates outputs from GPT-4o, Gemini 2.5 Flash, and Mistral Small across four language pairs (English–Farsi (Iranian Persian), English–Arabic, English–Central Kurdish(Sorani), English–Pashto).
The dataset provides both human evaluations from native-speaker evaluators at Respond Crisis Translation and LLM-as-judge assessments (powered by Gemini 2.5 Flash). It includes 86 columns capturing six evaluation dimensions per language (actionability, factual accuracy, tone and empathy, non-discrimination, safety and privacy, refusals and freedom of access to information), cross-lingual disparity scores, response times, and word counts. This helps systematic research on LLM performance, multilingual consistency, and automated evaluation reliability in humanitarian contexts.
Notes about the factuality assessment: Human evaluators had access to Google Search for checking NGO names, contact information, referenced laws, acronyms, etc., which were required for assessing the factuality score of the LLM responses. In some cases, human evaluators did not fully verify the entities, laws, and related details; therefore, the columns associated with entity checks may not be entirely reliable. On the other hand, the LLM-as-a-judge did not have access to Google Search, and its factuality scores are not based on actual verification of entities or other information that could be fact-checked.
To view the full study, please visit: https://www.multilingualailab.com/