BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding
  in Video-Language Models - Speaker to be confirmed
DTSTART:20231026T100000Z
DTEND:20231026T110000Z
UID:TALK207655@talks.cam.ac.uk
CONTACT:Panagiotis Fytas
DESCRIPTION:Vision-and-Language (VL) models should be able to address shor
 tcomings of Large Language Models (LLMs) (e.g.\, lack of symbol grounding\
 , reporting bias affecting commonsense knowledge) by transferring informat
 ion encoded in the visual domain to the language domain\, and vice versa\,
  by successfully modeling a cross-modal space. Fueled by the recent succes
 ses of VL models integrating text with images\, the community has started 
 researching VL models integrating text with video sequences. Integrating l
 anguage with temporal video sequences should provide (i) models with bette
 r grounding capabilities as well as (ii) the ability to capitalize on an e
 ven bigger amount of tacit knowledge\, such as presuppositions\, consequen
 ces\, or temporal reasoning. Despite promising results on multimodal tasks
  (such as Image Captioning\, Visual Question Answering\, Image-Text Retrie
 val etc.)\, recent literature has shown that models integrating image and 
 text are highly susceptible to statistical bias present in large-scale tra
 ining data\, enabling them to solve multi-modal tasks without actually lev
 eraging multi-modal signals. Analogously\, we focus our analysis on Video-
 and-Language models (VidLMs) and construct VILMA (Video Language Model Ass
 essment)\, a task-agnostic benchmark that places the assessment of fine-gr
 ained capabilities of these models on a firm footing. Task-based evaluatio
 ns\, while valuable\, fail to capture the complexities and specific tempor
 al aspects of moving images that VidLMs need to process. Through carefully
  curated counterfactuals\, VILMA offers a controlled evaluation suite that
  sheds light on the true potential of these models\, as well as their perf
 ormance gaps compared to human-level understanding. VILMA also includes pr
 oficiency tests\, which assess basic capabilities deemed essential to solv
 ing the main counterfactual tests. We show that current VidLMs’ groundin
 g abilities are no better than those of vision-language models which use s
 tatic images. This is especially striking once the performance on proficie
 ncy tests is factored in. Our benchmark serves as a catalyst for future re
 search on VidLMs\, helping to highlight areas that still need to be explor
 ed.\n
LOCATION:GR05\, English Faculty Building\, 9 West Road\, Sidgwick Site
END:VEVENT
END:VCALENDAR