Subtleties about Pre-Training Data: Imbalance and Staleness
- đ¤ Speaker: Daniel Khashabi, Johns Hopkins University
- đ Date & Time: Thursday 07 November 2024, 16:00 - 17:00
- đ Venue: https://cam-ac-uk.zoom.us/j/97599459216?pwd=QTRsOWZCOXRTREVnbTJBdXVpOXFvdz09
Abstract
Abstract: The success of pre-trained large language models (LLMs) is largely attributed to the extensive and diverse data used during their pre-training phase. Leveraging this pre-training effectively can lead to notable improvements in model quality, robustness, and cost-efficiency. Firstly, I will address the challenges of [pre-]training on imbalanced datasets, such as those found in multilingual settings where data availability varies greatly between high- and low-resource languages. Common approaches to mitigate this issue include upsampling low-resource languages or enhancing their loss weight. Although these methods are often seen as equivalent, I will demonstrate through theoretical and empirical evidence that they are distinct. Based on these insights, we propose a strategy for efficient and balanced training on imbalanced datasets. Secondly, I will investigate the issue of temporal degradation in LLMs, which arises after the cutoff dates for training data collection. Our empirical evidence indicates that this degradation often begins well before the stated cutoff, a point we call the “effective cutoff” date. I will discuss our analysis of open pre-training datasets, which uncovers the main causes for these observations. These findings imply that knowledge cutoffs are more intricate than previously thought, necessitating careful consideration from both LLM dataset curators and users.
Based on the following works:
1. Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets: https://arxiv.org/abs/2410.04579
2. Dated Data: Tracing Knowledge Cutoffs in Large Language Models: https://arxiv.org/abs/2403.12958
Bio: Daniel Khashabi is an assistant professor of computer science at Johns Hopkins University and is affiliated with the Center for Language and Speech Processing (CLSP) and the Data Science and AI Institute. He is interested in building reasoning-driven modular NLP systems that are robust, transparent, and communicative, particularly those that use natural language as the communication medium. Khashabi has published over 50 papers on natural language processing and AI in top-tier venues. His research has won the best paper awards at COLM (2024), ACL (2023), and NAACL (2022), Amazon Research Award (2022) and AI2 âs Last Impact Award (2024). Before joining Hopkins, he was a postdoctoral fellow at the Allen Institute for AI (2019-2022) and obtained a Ph.D. from the University of Pennsylvania in 2019.
Series This talk is part of the Language Technology Lab Seminars series.
Included in Lists
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge talks
- Chris Davis' list
- Guy Emerson's list
- https://cam-ac-uk.zoom.us/j/97599459216?pwd=QTRsOWZCOXRTREVnbTJBdXVpOXFvdz09
- Interested Talks
- Language Sciences for Graduate Students
- Language Technology Lab Seminars
- ndk22's list
- ob366-ai4er
- rp587
- Simon Baker's List
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)

Daniel Khashabi, Johns Hopkins University
Thursday 07 November 2024, 16:00-17:00