BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Optimizing for the Long-Term Without Delay - Kelly Zhang (Imperial
  College London)
DTSTART:20251110T163000Z
DTEND:20251110T171000Z
UID:TALK238546@talks.cam.ac.uk
DESCRIPTION:Increasingly\, recommender systems are tasked with improving u
 sers&rsquo\; long-term satisfaction. In this context\, we study a content 
 exploration task\, which we formalize as a bandit problem with delayed rew
 ards. There is an apparent trade-off in choosing the learning signal: wait
 ing for the full reward to become available might take several weeks\, slo
 wing the rate of learning\, whereas using short-term proxy rewards reflect
 s the actual long-term goal only imperfectly. First\, we develop a predict
 ive model of delayed rewards that incorporates all information obtained to
  date. Rewards as well as shorter-term surrogate outcomes are combined thr
 ough a Bayesian filter to obtain a probabilistic belief. Second\, we devis
 e a bandit algorithm that quickly learns to identify content aligned with 
 long-term success using this new predictive model. We prove a regret bound
  for our algorithm that depends on the Value of Progressive Feedback\, an 
 information theoretic metric that captures the quality of short-term leadi
 ng indicators that are observed prior to the long-term reward. We apply ou
 r approach to a podcast recommendation problem\, where we seek to recommen
 d shows that users engage with repeatedly over two months. We empirically 
 validate that our approach significantly outperforms methods that optimize
  for short-term proxies or rely solely on delayed rewards\, as demonstrate
 d by an A/B test in a recommendation system that serves hundreds of millio
 ns of users.
LOCATION:Seminar Room 1\, Newton Institute
END:VEVENT
END:VCALENDAR
