BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Asymmetry in Supposedly Equivalent Facts: Pre-training Bias in Lar
 ge Language Models - Zifeng Ding (University of Cambridge)
DTSTART:20250502T110000Z
DTEND:20250502T120000Z
UID:TALK227419@talks.cam.ac.uk
CONTACT:Suchir Salhan
DESCRIPTION:Understanding and mitigating hallucinations in Large Language 
 Models (LLMs) is crucial for ensuring reliable content generation. While p
 revious research has primarily focused on “when” LLMs hallucinate\, ou
 r work explains “why” and directly links model behaviour to the pre-tr
 aining data that forms their prior knowledge. Specifically\, we demonstrat
 e that an asymmetry exists in the recognition of logically equivalent fact
 s\, which can be attributed to frequency discrepancies of entities appeari
 ng as subjects versus objects. Given that most pre-training datasets are i
 naccessible\, we leverage the fully open-source OLMo series by indexing it
 s Dolma dataset to estimate entity frequencies. Using relational facts (re
 presented as triples) from Wikidata5M\, we construct probing datasets to i
 solate this effect. Our experiments reveal that facts with a high-frequenc
 y subject and a low-frequency object are better recognised than their inve
 rse\, despite their logical equivalence. The pattern reverses in low-to-hi
 gh frequency settings\, and no statistically significant asymmetry emerges
  when both entities are high-frequency. These findings underscore the infl
 uential role of pre-training data in shaping model predictions and provide
  insights for inferring the characteristics of pre-training data in closed
  or partially closed LLMs.
LOCATION:Room FW26 with Hybrid Format. Here is the Zoom link for those tha
 t wish to join online: https://cam-ac-uk.zoom.us/j/4751389294?pwd=Z2ZOSDk0
 eG1wZldVWG1GVVhrTzFIZz09
END:VEVENT
END:VCALENDAR
