BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Research Progress in Mechanistic Interpretability - Arthur Conmy (
 Google DeepMind)
DTSTART:20250509T110000Z
DTEND:20250509T120000Z
UID:TALK229927@talks.cam.ac.uk
CONTACT:Suchir Salhan
DESCRIPTION:The goal of Mechanistic Interpretability research is to explai
 n how neural networks compute outputs in terms of their internal component
 s. But how much progress has been made towards this goal? While a large am
 ount of Mechanistic Interpretability research has been produced by academi
 a\, frontier AI companies such as Google DeepMind and independent research
 ers in recent years\, there are still large open problems in the field. In
  this talk\, I will begin by discussing some background hypotheses and tec
 hniques in Mechanistic Interpretability\, such as the Linear Representatio
 n Hypothesis and common causal interventions. Then\, I’ll discuss how th
 is connects to research we’ve done at Google DeepMind in the past year\,
  such as open sourcing Gemma Scope\, the most comprehensive set of Sparse 
 Autoencoders\, which took over 20% of the compute used to train GPT-3. Fin
 ally\, I’ll reflect on current priorities and disagreements in Mechanist
 ic Interpretability\, several of which are built from Gemma Scope. In shor
 t\, Mechanistic Interpretability is able to uncover factors influencing mo
 del behavior that cannot naively be inferred from prompts and outputs via 
 circuits research\, but Mechanistic Interpretability has thus far underper
 formed when benchmarked on well-defined real-world tasks (such as probing 
 for harmful intent in user prompts).\n\nArthur Conmy is a Senior Research 
 Engineer at Google DeepMind who works on the Mechanistic Interpretability 
 team.\n
LOCATION:Room FW26 with Hybrid Format. Here is the Zoom link for those tha
 t wish to join online: https://cam-ac-uk.zoom.us/j/4751389294?pwd=Z2ZOSDk0
 eG1wZldVWG1GVVhrTzFIZz09
END:VEVENT
END:VCALENDAR
