University of Cambridge > Talks.cam > Computer Vision Seminars > Toward Generalizable and Intelligent Visual Reasoning Models

Toward Generalizable and Intelligent Visual Reasoning Models

Download to your calendar using vCal

If you have a question about this talk, please contact Elliott Wu .

Visual reasoning models have made remarkable progress in recent years, yet they are still not widely deployed in critical real-world settings—where data is scarce, tasks are multi-step, and outputs must be inspectable and verifiable. To address this gap, I propose building multimodal reasoning models with structural priors that can robustly perceive, interpret, and interact with the physical world under human specified-instructions. In this talk, I will cover a spectrum of modeling paradigms and environments: (1) neuro-symbolic models, where hybrid explicit-implicit representations provide efficiency and generalization by design in structured settings; (2) foundation model-distilled frameworks, which externalize prior knowledge to structure vision-language models’ reasoning process in open-ended domains; (3) structure-induction frameworks, which use interpretable representational bottlenecks to uncover patterns in complex, unlabeled visual data. I will conclude by outlining a path toward visual-language models that can generalize across diverse sensing modalities and conduct intelligent decision-making in the real world.

Bio: Joy Hsu is a PhD candidate in Computer Science at Stanford University, advised by Prof. Jiajun Wu. Her research focuses on making visual reasoning models reliable in real-world settings under sensing, data, and compute constraints. She develops multimodal reasoning models with structural priors that enable systems to perceive, interpret, and interact intelligently with the physical world across diverse, data-scarce domains. She is a recipient of the Knight-Hennessy Fellowship and the NSF Fellowship, and was awarded third place in the Amazon Robotics PhD competition and named a Rising Star in AI in 2025.

Zoom: https://cam-ac-uk.zoom.us/j/85290977324?pwd=C3KItZS8d2XaVyUsb88HKuV5wWFrYV.1

This talk is part of the Computer Vision Seminars series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2025 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity