BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:A Bayesian Perspective on Generalization and SGD - Dr. Samuel L. S
 mith\, Google Brain
DTSTART:20180417T100000Z
DTEND:20180417T110000Z
UID:TALK104371@talks.cam.ac.uk
CONTACT:Alexander Matthews
DESCRIPTION:ABSTRACT:\n\nThis talk presents simple Bayesian insights on tw
 o fundamental questions:\n\n1. How can we predict whether a model optimize
 d on the training set will perform well on new test data?\n\n2. Why is Sto
 chastic Gradient Descent unreasonably effective at finding local minima th
 at perform well?\n\nI will begin with a brief refresher on Bayesian model 
 comparison\, demonstrating that we ought to seek "flat" local minima which
  minimize a weighted combination of the value of the cost function at the 
 minimum and an "Occam factor" which penalizes curvature. \n\nZhang et al. 
 [1] received the best paper award at ICLR 2017 for demonstrating deep conv
 olutional networks can easily memorize random relabelings of their trainin
 g sets. We show that the same phenomenon occurs in linear models. Bayesian
  model comparison successfully rejects models trained on random labels but
  accepts models trained on informative labels.\n\nKeskar et al. [2] found 
 that the performance of deep learning models often improves if one reduces
  the SGD batch size used to estimate the gradient. We argue that this can 
 be understood directly from the principles above. Reducing the batch size 
 introduces noise to the\nparameter updates\, and this noise drives SGD tow
 ards flat minima which are likely to generalize well. Treating SGD as a st
 ochastic differential equation\, we predict scaling rules which describe h
 ow the\noptimum batch size is controlled by the learning rate\, training s
 et size and momentum coefficient. Finally\, we demonstrate that decaying t
 he learning rate and increasing the batch size during training are equival
 ent\; obtaining the same test accuracy after the same number of training e
 pochs\, and we use this insight to train ResNet-50 on TPU in under 30 minu
 tes.\n\n[1] Understanding deep learning requires rethinking generalization
 \, Zhang et al.\, ICLR 2017\n\n[2] On Large-Batch Training for Deep Learni
 ng: Generalization Gap and Sharp Minima\, Keskar et al.\, ICLR 2017\n\nBIO
 :\n\nFollowing a PhD in theoretical Physics at the University of Cambridge
 \, Sam joined the machine learning team at Babylon health\, developing a m
 edical chatbot for primary care. In July 2017\, he moved to California\nfo
 r the Google Brain Residency. His research is focused on optimization and 
 natural language processing.
LOCATION:Cambridge University Engineering Department\, Lecture Room 11
END:VEVENT
END:VCALENDAR