Tu, H. (CSE) – From Evaluation to Adaptation: Building Reliable Multimodal Intelligence

Multimodal large language models (MLLMs) are rapidly becoming general-purpose AI systems, yet their capabilities are advancing faster than our ability to evaluate, improve, and validate their reliability in realistic use. Standard benchmarks mainly measure in-distribution final-answer accuracy, leaving critical gaps in safety, robustness, fine-grained reasoning evaluation, and reliability in real-world agentic settings. My research proposes an evaluation-to-adaptation framework for building reliable multimodal intelligence: developing rigorous evaluations that expose failures beyond conventional benchmarks, learning feedback models that guide inference-time reasoning, and studying how multimodal systems can adapt through experience. We instantiate this agenda through two completed works and two proposed directions. Unicorn evaluates safety and robustness under out-of-distribution and adversarial conditions, revealing substantial vulnerabilities across 22 vision-language models. ViLBench studies vision-language process reward modeling as both an evaluation challenge and a mechanism for inference-time improvement, showing that process-guided reasoning selection can improve reliability. Building on these foundations, we further study test-time experience accumulation and explore reliable multimodal agents for GUI and computer-use tasks. Together, my research aims to move beyond capability-driven progress alone, toward multimodal AI systems whose reliability can be evaluated, improved, and tested in realistic deployment settings.
Event Host: Haoqin Tu, Ph.D. Student, Computer Science & Engineering
Advisor: Cihang Xie
Zoom: 964 1355 0550
Passcode: zWxU8A