University of Cambridge > Talks.cam > Machine Learning @ CUED > A Bayesian Perspective on Generalization and SGD

Log in

University Account

External (via Google)

Information on

Subscribing to talks details

Finding a talk details

Adding a talk details

Disseminating talks details

Help and Documentation details

A Bayesian Perspective on Generalization and SGD

Download to your calendar using vCal

Dr. Samuel L. Smith, Google Brain
Tuesday 17 April 2018, 11:00-12:00
Cambridge University Engineering Department, Lecture Room 11.

If you have a question about this talk, please contact Alexander Matthews .

ABSTRACT :

This talk presents simple Bayesian insights on two fundamental questions:

1. How can we predict whether a model optimized on the training set will perform well on new test data?

2. Why is Stochastic Gradient Descent unreasonably effective at finding local minima that perform well?

I will begin with a brief refresher on Bayesian model comparison, demonstrating that we ought to seek “flat” local minima which minimize a weighted combination of the value of the cost function at the minimum and an “Occam factor” which penalizes curvature.

Zhang et al. [1] received the best paper award at ICLR 2017 for demonstrating deep convolutional networks can easily memorize random relabelings of their training sets. We show that the same phenomenon occurs in linear models. Bayesian model comparison successfully rejects models trained on random labels but accepts models trained on informative labels.

Keskar et al. [2] found that the performance of deep learning models often improves if one reduces the SGD batch size used to estimate the gradient. We argue that this can be understood directly from the principles above. Reducing the batch size introduces noise to the parameter updates, and this noise drives SGD towards flat minima which are likely to generalize well. Treating SGD as a stochastic differential equation, we predict scaling rules which describe how the optimum batch size is controlled by the learning rate, training set size and momentum coefficient. Finally, we demonstrate that decaying the learning rate and increasing the batch size during training are equivalent; obtaining the same test accuracy after the same number of training epochs, and we use this insight to train ResNet-50 on TPU in under 30 minutes.

[1] Understanding deep learning requires rethinking generalization, Zhang et al., ICLR 2017

[2] On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, Keskar et al., ICLR 2017

BIO :

Following a PhD in theoretical Physics at the University of Cambridge, Sam joined the machine learning team at Babylon health, developing a medical chatbot for primary care. In July 2017, he moved to California for the Google Brain Residency. His research is focused on optimization and natural language processing.

This talk is part of the Machine Learning @ CUED series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

A Bayesian Perspective on Generalization and SGD

📅 Download to calendar (vCal)

👤 Speaker: Dr. Samuel L. Smith, Google Brain 🔗 Website
📅 Date & Time: Tuesday 17 April 2018, 11:00 - 12:00
📍 Venue: Cambridge University Engineering Department, Lecture Room 11

Questions? Contact Alexander Matthews

Abstract

ABSTRACT :

This talk presents simple Bayesian insights on two fundamental questions:

1. How can we predict whether a model optimized on the training set will perform well on new test data?

2. Why is Stochastic Gradient Descent unreasonably effective at finding local minima that perform well?

[1] Understanding deep learning requires rethinking generalization, Zhang et al., ICLR 2017

[2] On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, Keskar et al., ICLR 2017

BIO :

Series This talk is part of the Machine Learning @ CUED series.

Included in Lists

Note: Ex-directory lists are not shown.

Log in

🔐 Log In

Information on

ℹ️ Information

A Bayesian Perspective on Generalization and SGD

This talk is included in these lists:

A Bayesian Perspective on Generalization and SGD

Abstract

Included in Lists

Log in

🔐 Log In

Information on

ℹ️ Information

A Bayesian Perspective on Generalization and SGD

This talk is included in these lists:

Other lists

Other talks

A Bayesian Perspective on Generalization and SGD

Abstract

Included in Lists