BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:How does gradient descent work? - Jeremy Cohen (Flatiron)
DTSTART:20251212T143000Z
DTEND:20251212T153000Z
UID:TALK241861@talks.cam.ac.uk
CONTACT:Xianda Sun
DESCRIPTION:Optimization is the engine of deep learning\, yet the theory o
 f optimization has had little impact on the practice of deep learning. Why
 ? In this talk\, we will first show that traditional theories of optimizat
 ion cannot explain the convergence of the simplest optimization algorithm 
 — deterministic gradient descent — in deep learning. Whereas tradition
 al theories assert that gradient descent converges because the curvature o
 f the loss landscape is “a priori” small\, we will explain how in real
 ity\, gradient descent converges because it *dynamically avoids* high-curv
 ature regions of the loss landscape. Understanding this behavior requires 
 Taylor expanding to third order\, which is one order higher than normally 
 used in optimization theory. While the “fine-grained” dynamics of grad
 ient descent involve chaotic oscillations that are difficult to analyze\, 
 we will demonstrate that the “time-averaged” dynamics are\, fortunatel
 y\, much more tractable. We will present an analysis of these time-average
 d dynamics that yields highly accurate quantitative predictions in a varie
 ty of deep learning settings. Since gradient descent is the simplest optim
 ization algorithm\, we hope this analysis can help point the way towards a
  mathematical theory of optimization in deep learning.\n\nBio: Jeremy Cohe
 n is a research fellow at the Flatiron Institute.  He has recently been wo
 rking on understanding optimization in deep learning.  He obtained his PhD
  in 2024 from Carnegie Mellon University\, advised by Zico Kolter and Amee
 t Talwalkar.
LOCATION:Cambridge University Engineering Department\, CBL Seminar room BE
 4-38.
END:VEVENT
END:VCALENDAR
