BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Subtleties about Pre-Training Data: Imbalance and Staleness - Dani
 el Khashabi\, Johns Hopkins University
DTSTART:20241107T160000Z
DTEND:20241107T170000Z
UID:TALK222277@talks.cam.ac.uk
CONTACT:Tiancheng Hu
DESCRIPTION:Abstract: The success of pre-trained large language models (LL
 Ms) is largely attributed to the extensive and diverse data used during th
 eir pre-training phase. Leveraging this pre-training effectively can lead 
 to notable improvements in model quality\, robustness\, and cost-efficienc
 y. Firstly\, I will address the challenges of [pre-]training on imbalanced
  datasets\, such as those found in multilingual settings where data availa
 bility varies greatly between high- and low-resource languages. Common app
 roaches to mitigate this issue include upsampling low-resource languages o
 r enhancing their loss weight. Although these methods are often seen as eq
 uivalent\, I will demonstrate through theoretical and empirical evidence t
 hat they are distinct. Based on these insights\, we propose a strategy for
  efficient and balanced training on imbalanced datasets. Secondly\, I will
  investigate the issue of temporal degradation in LLMs\, which arises afte
 r the cutoff dates for training data collection. Our empirical evidence in
 dicates that this degradation often begins well before the stated cutoff\,
  a point we call the "effective cutoff" date. I will discuss our analysis 
 of open pre-training datasets\, which uncovers the main causes for these o
 bservations. These findings imply that knowledge cutoffs are more intricat
 e than previously thought\, necessitating careful consideration from both 
 LLM dataset curators and users.\n\nBased on the following works: \n\n1. Up
 sample or Upweight? Balanced Training on Heavily Imbalanced Datasets: http
 s://arxiv.org/abs/2410.04579\n\n2. Dated Data: Tracing Knowledge Cutoffs i
 n Large Language Models: https://arxiv.org/abs/2403.12958\n\nBio: Daniel K
 hashabi is an assistant professor of computer science at Johns Hopkins Uni
 versity and is affiliated with the Center for Language and Speech Processi
 ng (CLSP) and the Data Science and AI Institute. He is interested in build
 ing reasoning-driven modular NLP systems that are robust\, transparent\, a
 nd communicative\, particularly those that use natural language as the com
 munication medium. Khashabi has published over 50 papers on natural langua
 ge processing and AI in top-tier venues. His research has won the best pap
 er awards at COLM (2024)\, ACL (2023)\, and NAACL (2022)\, Amazon Research
  Award (2022) and AI2’s Last Impact Award (2024). Before joining Hopkins
 \, he was a postdoctoral fellow at the Allen Institute for AI (2019-2022) 
 and obtained a Ph.D. from the University of Pennsylvania in 2019.
LOCATION:https://cam-ac-uk.zoom.us/j/97599459216?pwd=QTRsOWZCOXRTREVnbTJBd
 XVpOXFvdz09
END:VEVENT
END:VCALENDAR
