BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:One-shot visual language understanding with cross-modal translatio
 n and LLMs - Fangyu LIU (University of Cambridge)
DTSTART:20230223T110000Z
DTEND:20230223T120000Z
UID:TALK197731@talks.cam.ac.uk
CONTACT:Panagiotis Fytas
DESCRIPTION:Visual language such as charts and plots is ubiquitous in the 
 human world. Comprehending plots and charts requires strong reasoning skil
 ls. Prior state-of-the-art (SOTA) models require at least tens of thousand
 s of training examples and their reasoning capabilities are still much lim
 ited\, especially on complex human-written queries. We present the first o
 ne-shot solution to visual language reasoning. We decompose the challenge 
 of visual language reasoning into two steps: (1) plot-to-text translation\
 , and (2) reasoning over the translated text. The key in this method is a 
 modality conversion module\, named as DePlot\, which translates the image 
 of a plot or chart to a linearized table. The output of DePlot can then be
  directly used to prompt a pretrained large language model (LLM)\, exploit
 ing the few-shot reasoning capabilities of LLMs. To obtain DePlot\, we sta
 ndardize the plot-to-table task by establishing unified task formats and m
 etrics\, and train DePlot end-to-end on this task. DePlot can then be used
  off-the-shelf together with LLMs in a plug-and-play fashion. Compared wit
 h a SOTA model finetuned on thousands of data points\, DePlot+LLM with jus
 t one-shot prompting achieves a 29.4% improvement over finetuned SOTA on h
 uman-written queries from the task of chart QA.
LOCATION:GR04\, English Faculty Building\, 9 West Road\, Sidgwick Site
END:VEVENT
END:VCALENDAR