BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Learning generalizable models on large-scale multi-modal data - Yu
 tian Chen - DeepMind
DTSTART:20231018T130000Z
DTEND:20231018T140000Z
UID:TALK207205@talks.cam.ac.uk
CONTACT:James Fergusson
DESCRIPTION:The abundant spectrum of multi-modal data provides a significa
 nt opportunity for augmenting the training of foundational models beyond m
 ere text. In this talk\, I will introduce two lines of work that leverage 
 large-scale models\, trained on Internet-scale multi-modal datasets\, to a
 chieve good generalization performance. The first work trains an audio-vis
 ual model on YouTube datasets of videos and enables automatic video transl
 ation and dubbing. The model is able to learn the correspondence between a
 udio and visual features\, and use this knowledge to translate videos from
  one language to another. The second work trains a multi-modal\, multi-tas
 k\, multi-embodiment generalist policy on a massive collection of simulate
 d control tasks\, vision\, language\, and robotics. The model is able to l
 earn to perform a variety of tasks\, including controlling a robot arm\, p
 laying a game\, and translating text. Both lines of work exhibit the poten
 tial future trajectory of foundational models\, highlighting the transform
 ative power of integrating multi-modal inputs and outputs.
LOCATION:Maxwell Centre
END:VEVENT
END:VCALENDAR
