Toward Generalizable and Intelligent Visual Reasoning Models
- 👤 Speaker: Joy Hsu, Stanford University 🔗 Website
- 📅 Date & Time: Monday 23 March 2026, 14:00 - 15:00
- 📍 Venue: LT6, Department of Engineering
Abstract
Visual reasoning models have made remarkable progress in recent years, yet they are still not widely deployed in critical real-world settings—where data is scarce, tasks are multi-step, and outputs must be inspectable and verifiable. To address this gap, I propose building multimodal reasoning models with structural priors that can robustly perceive, interpret, and interact with the physical world under human specified-instructions. In this talk, I will cover a spectrum of modeling paradigms and environments: (1) neuro-symbolic models, where hybrid explicit-implicit representations provide efficiency and generalization by design in structured settings; (2) foundation model-distilled frameworks, which externalize prior knowledge to structure vision-language models’ reasoning process in open-ended domains; (3) structure-induction frameworks, which use interpretable representational bottlenecks to uncover patterns in complex, unlabeled visual data. I will conclude by outlining a path toward visual-language models that can generalize across diverse sensing modalities and conduct intelligent decision-making in the real world.
Bio: Joy Hsu is a PhD candidate in Computer Science at Stanford University, advised by Prof. Jiajun Wu. Her research focuses on making visual reasoning models reliable in real-world settings under sensing, data, and compute constraints. She develops multimodal reasoning models with structural priors that enable systems to perceive, interpret, and interact intelligently with the physical world across diverse, data-scarce domains. She is a recipient of the Knight-Hennessy Fellowship and the NSF Fellowship, and was awarded third place in the Amazon Robotics PhD competition and named a Rising Star in AI in 2025.
Zoom: https://cam-ac-uk.zoom.us/j/85290977324?pwd=C3KItZS8d2XaVyUsb88HKuV5wWFrYV.1
Series This talk is part of the Computer Vision Seminars series.
Included in Lists
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)



Monday 23 March 2026, 14:00-15:00