A Bayesian Perspective on Generalization and SGD
- đ¤ Speaker: Dr. Samuel L. Smith, Google Brain đ Website
- đ Date & Time: Tuesday 17 April 2018, 11:00 - 12:00
- đ Venue: Cambridge University Engineering Department, Lecture Room 11
Abstract
ABSTRACT :
This talk presents simple Bayesian insights on two fundamental questions:
1. How can we predict whether a model optimized on the training set will perform well on new test data?
2. Why is Stochastic Gradient Descent unreasonably effective at finding local minima that perform well?
I will begin with a brief refresher on Bayesian model comparison, demonstrating that we ought to seek “flat” local minima which minimize a weighted combination of the value of the cost function at the minimum and an “Occam factor” which penalizes curvature.
Zhang et al. [1] received the best paper award at ICLR 2017 for demonstrating deep convolutional networks can easily memorize random relabelings of their training sets. We show that the same phenomenon occurs in linear models. Bayesian model comparison successfully rejects models trained on random labels but accepts models trained on informative labels.
Keskar et al. [2] found that the performance of deep learning models often improves if one reduces the SGD batch size used to estimate the gradient. We argue that this can be understood directly from the principles above. Reducing the batch size introduces noise to the parameter updates, and this noise drives SGD towards flat minima which are likely to generalize well. Treating SGD as a stochastic differential equation, we predict scaling rules which describe how the optimum batch size is controlled by the learning rate, training set size and momentum coefficient. Finally, we demonstrate that decaying the learning rate and increasing the batch size during training are equivalent; obtaining the same test accuracy after the same number of training epochs, and we use this insight to train ResNet-50 on TPU in under 30 minutes.
[1] Understanding deep learning requires rethinking generalization, Zhang et al., ICLR 2017
[2] On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, Keskar et al., ICLR 2017
BIO :
Following a PhD in theoretical Physics at the University of Cambridge, Sam joined the machine learning team at Babylon health, developing a medical chatbot for primary care. In July 2017, he moved to California for the Google Brain Residency. His research is focused on optimization and natural language processing.
Series This talk is part of the Machine Learning @ CUED series.
Included in Lists
- All Talks (aka the CURE list)
- Biology
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge Neuroscience Seminars
- Cambridge talks
- Cambridge University Engineering Department, Lecture Room 11
- CBL important
- Chris Davis' list
- Creating transparent intact animal organs for high-resolution 3D deep-tissue imaging
- dh539
- dh539
- Featured lists
- Guy Emerson's list
- Hanchen DaDaDash
- Inference Group Summary
- Information Engineering Division seminar list
- Interested Talks
- Joint Machine Learning Seminars
- Life Science
- Life Sciences
- Machine Learning @ CUED
- Machine Learning Summary
- ML
- ndk22's list
- Neuroscience
- Neuroscience Seminars
- Neuroscience Seminars
- ob366-ai4er
- Required lists for MLG
- rp587
- Seminar
- Simon Baker's List
- Stem Cells & Regenerative Medicine
- Trust & Technology Initiative - interesting events
- yk373's list
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)

Dr. Samuel L. Smith, Google Brain 
Tuesday 17 April 2018, 11:00-12:00