BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Vibe checks and red teaming: why ML researchers are increasingly r
 everting to manual evaluation - Arduin Findeis (University of Cambridge)
DTSTART:20240116T130000Z
DTEND:20240116T140000Z
UID:TALK206302@talks.cam.ac.uk
CONTACT:Mateja Jamnik
DESCRIPTION:There is a curious trend in machine learning (ML): researchers
  developing the most capable large language models (LLMs) increasingly eva
 luate them using manual methods such as red teaming. In red teaming\, rese
 archers hire workers to manually try to break the LLM in some form by inte
 racting with it. Similarly\, some users pick their preferred LLM assistant
  by manually trying out various models – checking each LLM's "vibe". Con
 sidering that LLM researchers and users both actively seek to automate all
  sorts of other tasks\, red teaming and vibe checks are surprisingly manua
 l evaluation processes. This trend towards manual evaluation hints at fund
 amental problems that prevent more automatic evaluation methods\, such as 
 benchmarks\, to be used effectively for LLMs. In this talk\, I aim to give
  an overview of the problems preventing LLM benchmarks from being a fully 
 satisfactory alternative to more manual approaches.\n\n"You can also join 
 us on Zoom":https://cam-ac-uk.zoom.us/j/92041617729
LOCATION:Lecture Theatre 2\, Computer Laboratory\, William Gates Building
END:VEVENT
END:VCALENDAR
