University of Cambridge > Talks.cam > NLIP Seminar Series > Research Progress in Mechanistic Interpretability

Log in

Google

Microsoft

Information on

Subscribing to talks details

Finding a talk details

Adding a talk details

Disseminating talks details

Help and Documentation details

Research Progress in Mechanistic Interpretability

Download to your calendar using vCal

Arthur Conmy (Google DeepMind)
Friday 09 May 2025, 12:00-13:00
Room FW26 with Hybrid Format. Here is the Zoom link for those that wish to join online: https://cam-ac-uk.zoom.us/j/4751389294?pwd=Z2ZOSDk0eG1wZldVWG1GVVhrTzFIZz09.

If you have a question about this talk, please contact Suchir Salhan .

The goal of Mechanistic Interpretability research is to explain how neural networks compute outputs in terms of their internal components. But how much progress has been made towards this goal? While a large amount of Mechanistic Interpretability research has been produced by academia, frontier AI companies such as Google DeepMind and independent researchers in recent years, there are still large open problems in the field. In this talk, I will begin by discussing some background hypotheses and techniques in Mechanistic Interpretability, such as the Linear Representation Hypothesis and common causal interventions. Then, I’ll discuss how this connects to research we’ve done at Google DeepMind in the past year, such as open sourcing Gemma Scope, the most comprehensive set of Sparse Autoencoders, which took over 20% of the compute used to train GPT -3. Finally, I’ll reflect on current priorities and disagreements in Mechanistic Interpretability, several of which are built from Gemma Scope. In short, Mechanistic Interpretability is able to uncover factors influencing model behavior that cannot naively be inferred from prompts and outputs via circuits research, but Mechanistic Interpretability has thus far underperformed when benchmarked on well-defined real-world tasks (such as probing for harmful intent in user prompts).

Arthur Conmy is a Senior Research Engineer at Google DeepMind who works on the Mechanistic Interpretability team.

This talk is part of the NLIP Seminar Series series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

Research Progress in Mechanistic Interpretability

📅 Download to calendar (vCal)

👤 Speaker: Arthur Conmy (Google DeepMind)
📅 Date & Time: Friday 09 May 2025, 12:00 - 13:00
📍 Venue: Room FW26 with Hybrid Format. Here is the Zoom link for those that wish to join online: https://cam-ac-uk.zoom.us/j/4751389294?pwd=Z2ZOSDk0eG1wZldVWG1GVVhrTzFIZz09

Questions? Contact Suchir Salhan

Abstract

Arthur Conmy is a Senior Research Engineer at Google DeepMind who works on the Mechanistic Interpretability team.

Series This talk is part of the NLIP Seminar Series series.

Included in Lists

Note: Ex-directory lists are not shown.

Log in

🔐 Log In

Information on

ℹ️ Information

Research Progress in Mechanistic Interpretability

This talk is included in these lists:

Research Progress in Mechanistic Interpretability

Abstract

Included in Lists

Log in

🔐 Log In

Information on

ℹ️ Information

Research Progress in Mechanistic Interpretability

This talk is included in these lists:

Other lists

Other talks

Research Progress in Mechanistic Interpretability

Abstract

Included in Lists