BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Beyond Surface Matching: Reasoning\, Grounding\, and Retrieval in 
 Vision-Language Models - Prof. Vicente Ordóñez-Román (Rice University)
DTSTART:20260319T110000Z
DTEND:20260319T120000Z
UID:TALK244540@talks.cam.ac.uk
CONTACT:Lucas Resck
DESCRIPTION:Abstract: Vision-language models have made remarkable progress
  on multimodal benchmarks\, yet much of this performance relies on shallow
  pattern matching — single-vector compression in retrieval\, brute-force
  training scaling in reasoning\, and surface-level lexical cues in groundi
 ng. In this talk\, I present recent work that addresses these limitations.
  I begin with MetaEmbed\, a flexible multi-vector retrieval framework that
  introduces learnable meta tokens processed by a vision-language backbone\
 , whose contextualized representations enable late interaction at variable
  granularity. Through a Matryoshka multi-vector training objective\, MetaE
 mbed learns coarse-to-fine embeddings that allow users to scale retrieval 
 quality against efficiency at test time\, achieving state-of-the-art resul
 ts on the MMEB and ViDoRe benchmarks across model scales up to 32B paramet
 ers. I then present ProxyThinker\, an inference-time method that transfers
  visual reasoning capabilities from small reinforcement-fine-tuned models 
 to larger base models without any additional training. By steering the lar
 ge model's token distributions using the logit difference between a small 
 reasoning expert and its base counterpart\, ProxyThinker elicits slow-thin
 king behaviors such as self-verification and backtracking\, achieving perf
 ormance competitive with full-scale reinforcement fine-tuning at a fractio
 n of the cost. I conclude with a brief overview of two ongoing directions:
  Referring Scenario Comprehension\, a benchmark that challenges grounding 
 models with non-literal\, scenario-based queries requiring reasoning over 
 user intent and relational context\; and Retrieval-Augmented Reinforcement
  Fine-Tuning\, which trains language models to reason by analogy through r
 etrieved demonstrations selected for reasoning utility rather than surface
  similarity.\n\nBio: Vicente Ordóñez-Román is an Associate Professor in
  the Department of Computer Science at Rice University. His research inter
 ests lie at the intersection of computer vision\, natural language process
 ing and machine learning. His focus is on building efficient visual recogn
 ition models that can perform tasks that leverage both images and text. He
  received a Best Paper Award at the conference on Empirical Methods in Nat
 ural Language Processing (EMNLP) 2017 and the Best Paper Award -- Marr Pri
 ze -- at the International Conference on Computer Vision (ICCV) 2013. He h
 as also been the recipient of an NSF CAREER Award\, an IBM Faculty Award\,
  a Google Faculty Research Award\, and a Facebook Research Award. From 201
 6-2021\, he was Assistant Professor in the Department of Computer Science 
 at the University of Virginia. Vicente obtained his PhD in Computer Scienc
 e at the University of North Carolina at Chapel Hill\, an MS at Stony Broo
 k University\, and an engineering degree at the Escuela Superior Politécn
 ica del Litoral in Ecuador. In the past\, he has also been a visiting rese
 archer at the Allen Institute for Artificial Intelligence\, Adobe Research
 \, Amazon Alexa AI and the Amazon AGI Foundations team.
LOCATION:SR14 (English Faculty Building\, 9 West Road\, Sidgwick Site) and
  online (https://cam-ac-uk.zoom.us/j/86890624365?pwd=oYGWpY7d5r3JOaUCaJXTD
 0sRECFxab.1)
END:VEVENT
END:VCALENDAR
