BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Data Repurposing: Improving LLM Capabilities with Synthetic Data G
 eneration - Abdullatif Köksal\, LMU Munich
DTSTART:20241128T110000Z
DTEND:20241128T120000Z
UID:TALK222283@talks.cam.ac.uk
CONTACT:Tiancheng Hu
DESCRIPTION:Abstract: Creating high-quality\, diverse\, large-scale datase
 ts remains a critical and time-consuming challenge in improving LLM capabi
 lities. Motivated by prior work that manually identifies implicit signals 
 in raw corpora\, we aim to address this challenge by investigating data re
 purposing strategies\, a methodology for automatically transforming existi
 ng data resources into new formats and purposes. First\, we propose revers
 e instructions to build an English instruction-following dataset by synthe
 tically generating instructions for a given human-written corpus document.
  The model trained with our synthetic dataset performs significantly bette
 r than other instruction-following models\, especially in long-form genera
 tion. Next\, in MURI\, we extend this approach to 200 languages to create 
 a culturally inclusive\, native dataset and multilingual instruction-follo
 wing models for very low-resource languages. Finally\, we customize this a
 pproach to generate any downstream dataset in CRAFT\, targeting unannotate
 d corpora to synthesize custom downstream task examples by retrieving and 
 rewriting corpus documents using few-shot examples. Our experiments demons
 trate that this approach can generate large-scale datasets for any given t
 ask\, showing up to a 25% improvement in tasks such as biology QA and summ
 arization compared to few-shot settings.\n\nBio: Abdullatif Köksal is a f
 inal-year ELLIS PhD student at CIS\, LMU Munich and LTL\, University of Ca
 mbridge\, supervised by Prof. Hinrich Schütze and Prof. Anna Korhonen. Hi
 s research focuses on improving LLM capabilities through effective data ut
 ilization and synthetic data generation. He has proposed several works aro
 und data repurposing by restructuring and augmenting existing data resourc
 es\, including reverse instructions for long-form instruction-tuning and a
  culturally-respectful multilingual instruction-following dataset for 200 
 languages. He expanded these approaches to dataset generation for downstre
 am tasks through better corpus mining with LLMs in CRAFT. He worked in oth
 er areas such as counterfactuality\, robustness\, and multilinguality and 
 published multiple papers in top-tier NLP venues. He interned at Google an
 d Amazon\, where he worked on counterfactuality and faithfulness.\n
LOCATION:GR04\, English Faculty Building\, 9 West Road\, Sidgwick Site and
  online https://cam-ac-uk.zoom.us/j/97599459216?pwd=QTRsOWZCOXRTREVnbTJBdX
 VpOXFvdz09
END:VEVENT
END:VCALENDAR