BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:The Past\, Present and Future of Tokenization - Benjamin Minixhofe
 r (Language Technology Lab\, University of Cambridge)
DTSTART:20241129T120000Z
DTEND:20241129T130000Z
UID:TALK222628@talks.cam.ac.uk
CONTACT:Suchir Salhan
DESCRIPTION:Abstract: \n\nCurrent large language models (LLMs) predominant
 ly use subword tokenization. They see text as chunks (called "tokens") mad
 e up of individual words\, or parts of words. This has a number of consequ
 ences. For example\, LLMs often struggle with seemingly simple tasks invol
 ving character-level knowledge\, like counting the number of letters in a 
 word or comparing two numbers. Subword tokenization can also lead to discr
 epancies across languages: processing English text with an LLM is often ch
 eaper than processing text in other languages. We will talk about how thes
 e issues came to be\, as well as how to potentially improve tokenization b
 y moving away from subwords (e.g.\, to models directly ingesting bytes) an
 d/or towards more adaptive\, modular\, tokenization. Finally\, we will con
 clude with discussing the far reach of tokenization into seemingly unrelat
 ed fields (model merging and multimodality).\n\nSpeaker Biography: Benjami
 n Minixhofer is a PhD student in the Language Technology Lab\, interested 
 in multilinguality\, tokenization and language emergence.
LOCATION:Zoom link: https://cam-ac-uk.zoom.us/j/4751389294?pwd=Z2ZOSDk0eG1
 wZldVWG1GVVhrTzFIZz09
END:VEVENT
END:VCALENDAR
