BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Tutorial: Gradient Methods in RL and Their Convergence - David Sis
 ka (University of Edinburgh)
DTSTART:20251107T100000Z
DTEND:20251107T130000Z
UID:TALK235627@talks.cam.ac.uk
DESCRIPTION:: Our aim is to learn about policy gradient methods for solvin
 g reinforcement learning (RL) problems modelled using the Markov decision 
 problem (MDP) framework with general (possibly continuous\, possibly infin
 ite dimensional) state and action spaces. We will focus mainly on theoreti
 cal convergence of mirror descent with direct parametrisation and natural-
 gradient descent when employing log-linear parametrisation. For our purpos
 es solving an RL problem means that we find a (nearly) optimal policy in a
  situation where the transition dynamics and costs are unknown but we can 
 repeatedly interact with some system (or environment simulator).\nThere ar
 e two main approaches to solving RL problems: action-value methods which l
 earn the state-action value function (the Q-function) and then select acti
 ons based on this. Their convergence is understood Watkins and Dayan [1992
 ]\, [Sutton and Barto\, 2018\, Ch. 6] and will not be discussed here. Poli
 cy gradient methods directly update the policy by stepping in the directio
 n of the gradient of the value function and have a long history for which 
 the reader is referred to [Sutton and Barto\, 2018\, Ch. 13]. Their conver
 gence is only understood in specific settings\, as we will see below. The 
 focus here is to cover generic (Polish) state and action spaces. We will t
 ouch upon the popular PPO algorithm Schulman et al. [2017] and explain the
  difficulties arising when trying to prove convergence of PPO.\nMany relat
 ed and interesting questions will not be touched upon: convergence of acto
 r-critic methods\, convergence in presence of Monte-Carlo errors\, regret\
 , off-policy gradient methods\, near continuous time RL.\nLarge parts of w
 hat is presented here in particular on mirror descent and natural-gradient
  descent is from Kerimkulov et al. [2025]. This work was itself inspired b
 y the recent results of Mei et al. [2021]\, Lan [2023] and Cayci et al. [2
 021].
LOCATION:Enigma Room\, The Alan Turing Institute
END:VEVENT
END:VCALENDAR