MMRefine 💭
Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models

ACL Findings 2025

1Theta One, Inc.  2NAVER Cloud AI  3KAIST AI
Corresponding author
*Most work was done during the intership at NAVER Cloud AI

Introduction

Recent advances have endowed Multimodal Large Language Models (MLLMs) with remarkable capabilities, enabling them to tackle complex challenges such as mathematical reasoning and multimodal understanding. Instead of concentrating solely on scaling model parameters during training, current research aims to strengthen inference-time reasoning. Techniques such as Self-Refinement, where models iteratively improve their output and engaging multiple models or agents in debate to achieve consensus have gained traction.

These methodologies heavily rely on the ability of MLLMs to evaluate and refine their responses. If such capability is not sufficiently secured, refinement might unintentionally impair performance, causing incorrect corrections and unnecessarily prolonged response times. Therefore, it is essential to investigate whether MLLMs can accurately identify and correct errors in their reasoning processes.

We introduce MMRefine, a MultiModal Refinement benchmark designed to evaluate the error refinement capabilities of Multimodal Large Language Models (MLLMs). With MMRefine, we can comprehensively analyze MLLMs' capability to detect and correct errors within a given initial solution across six distinct scenarios and six error types.

Leaderboard

Models are ranked by RefScore, with mRecall as the secondary sorting criterion.

To contribute to the leaderboard, please contact the authors at giopaik0@gmail.com.


Model Refinement Failure (RF) ⬇️ Error Detection (ED) ⬆️ Error Correction (EC) ⬆️ Refinement Success (RS) ⬆️ False Error Detection (FD) ⬇️ Verification Success (VS) ⬆️ RefScore ⬆️ mRecall ⬆️
1 Claude-4-Sonnet 2.06 97.94 65.10 50.09 92.13 7.87 42.23 95.04
2 Gemini-1.5-Pro 3.75 96.25 64.54 45.22 22.10 77.90 23.12 87.00
3 GPT-4o 15.57 84.43 43.15 29.27 6.74 93.26 22.53 88.84
4 Claude-3.5-Sonnet 27.95 72.05 32.65 18.95 6.74 93.26 12.21 82.65
5 LLaVA-OneVision-72B 31.14 68.86 21.76 11.07 4.87 95.13 6.20 81.99
6 InternVL2.5-4B 45.22 54.78 6.00 4.13 0.75 99.25 3.38 77.02
7 LLaVA-OneVision-7B 42.59 57.41 5.44 4.50 1.87 98.13 2.63 77.77
8 InternVL2.5-78B 15.57 84.43 32.65 20.26 17.98 82.02 2.29 83.23
9 LLaVA-Next-7B 42.40 57.60 5.44 4.13 4.49 95.51 -0.37 76.55
10 Llama-3.2-Vision-90B 16.89 83.11 28.33 16.51 17.23 82.77 -0.72 82.94
11 InternVL2.5-8B 25.14 74.86 11.44 5.82 10.49 89.51 -4.67 82.19
12 Qwen2-VL-72B 20.26 79.74 22.89 13.70 20.60 79.40 -6.90 79.57
13 Qwen2-VL-7B 19.70 80.30 22.51 21.39 32.21 67.79 -10.82 74.05
14 LLaVA-Next-72B 22.14 77.86 17.64 8.44 21.35 78.65 -12.91 78.26
15 Qwen2-VL-2B 51.59 48.41 3.19 2.44 19.10 80.90 -16.66 64.65
16 InternVL2.5-1B 41.09 58.91 3.75 1.88 19.85 80.15 -17.97 69.53
17 Llama-3.2-Vision-11B 22.14 77.86 16.14 10.51 32.96 67.04 -22.45 72.45
18 LLaVA-OneVision-0.5B 36.40 63.60 2.06 2.06 75.66 24.34 -73.59 43.97