The Past, Present and Future of Tokenization
- đ¤ Speaker: Benjamin Minixhofer (Language Technology Lab, University of Cambridge)
- đ Date & Time: Friday 29 November 2024, 12:00 - 13:00
- đ Venue: Zoom link: https://cam-ac-uk.zoom.us/j/4751389294?pwd=Z2ZOSDk0eG1wZldVWG1GVVhrTzFIZz09
Abstract
Abstract:
Current large language models (LLMs) predominantly use subword tokenization. They see text as chunks (called “tokens”) made up of individual words, or parts of words. This has a number of consequences. For example, LLMs often struggle with seemingly simple tasks involving character-level knowledge, like counting the number of letters in a word or comparing two numbers. Subword tokenization can also lead to discrepancies across languages: processing English text with an LLM is often cheaper than processing text in other languages. We will talk about how these issues came to be, as well as how to potentially improve tokenization by moving away from subwords (e.g., to models directly ingesting bytes) and/or towards more adaptive, modular, tokenization. Finally, we will conclude with discussing the far reach of tokenization into seemingly unrelated fields (model merging and multimodality).
Speaker Biography: Benjamin Minixhofer is a PhD student in the Language Technology Lab, interested in multilinguality, tokenization and language emergence.
Series This talk is part of the NLIP Seminar Series series.
Included in Lists
- All Talks (aka the CURE list)
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge talks
- Chris Davis' list
- Computer Education Research
- Computing Education Research
- Department of Computer Science and Technology talks and seminars
- Graduate-Seminars
- Guy Emerson's list
- Interested Talks
- Language Sciences for Graduate Students
- ndk22's list
- NLIP Seminar Series
- ob366-ai4er
- PMRFPS's
- rp587
- School of Technology
- Simon Baker's List
- Trust & Technology Initiative - interesting events
- yk449
- Zoom link: https://cam-ac-uk.zoom.us/j/4751389294?pwd=Z2ZOSDk0eG1wZldVWG1GVVhrTzFIZz09
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)


Friday 29 November 2024, 12:00-13:00