BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Tra
 ining - Antonio Norelli\, Sapienza University of Rome
DTSTART:20230215T150000Z
DTEND:20230215T160000Z
UID:TALK197683@talks.cam.ac.uk
CONTACT:Pietro Barbiero
DESCRIPTION:CLIP proved that aligning visual and language spaces is key to
  solving many vision tasks without explicit training\, but required to tra
 in image and text encoders from scratch on a huge dataset. LiT improved th
 is by only training the text encoder and using a pre-trained vision networ
 k. In this talk\, we will present the ASIF construction\, showing that a c
 ommon space can be created without any training at all\, using single-doma
 in encoders (trained with or without supervision) and a much smaller amoun
 t of image-text pairs.\n\nThen\, we will discuss the unique properties of 
 ASIF. Most notably\, deploying a new version with updated training samples
  can be done in a matter of seconds. Additionally\, the representations in
  the common space are easily interpretable as every dimension corresponds 
 to the similarity of the input to a unique entry in the multimodal dataset
 . We will look at experiments on standard zero-shot visual benchmarks that
  demonstrate the typical transfer ability of image-text models. Overall\, 
 ASIF represents a simple yet surprisingly strong baseline for foundation m
 ulti-modal models\, raising important questions on their data efficiency a
 nd on the role of retrieval in machine learning.
LOCATION:Venue to be confirmed
END:VEVENT
END:VCALENDAR