BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:End-to-End Fine-grained Multi-modal Understanding - Aishwarya Kama
 th\, New York University
DTSTART:20221117T110000Z
DTEND:20221117T120000Z
UID:TALK192863@talks.cam.ac.uk
CONTACT:Panagiotis Fytas
DESCRIPTION:Previously\, multi-modal reasoning systems relied on a pre-tra
 ined object detector to extract regions of interest from the image. Howeve
 r\, this crucial module was typically used as a black box\, trained indepe
 ndently of the downstream task and on a fixed vocabulary of objects and at
 tributes. This made it challenging for such systems to capture the long ta
 il of visual concepts expressed in free form text. In this talk\, I will f
 irst discuss MDETR\, an end-to-end modulated detector that detects objects
  in an image\, conditioned on a raw text query like a caption or a questio
 n. The model is trained on 1.3M text-image pairs\, mined from pre-existing
  multi-modal datasets having explicit alignment between phrases in text an
 d objects in the image. Next\, we will explore further developments in arc
 hitecture design that employ fusion between the visual and textual modalit
 ies deeper in the model\, achieving state of the art results when coupled 
 with a coarse-to-fine pre-training strategy. Finally\, I will discuss a no
 vel fine-grained visual understanding task and evaluation benchmark which 
 shows that existing benchmarks overestimate VL model's ability to understa
 nd and reason over complex visual scenes leaving substantial room for impr
 ovement.
LOCATION:GR04\, English Faculty Building\, 9 West Road\, Sidgwick Site
END:VEVENT
END:VCALENDAR
