Data Repurposing: Improving LLM Capabilities with Synthetic Data Generation
- 👤 Speaker: Abdullatif Köksal, LMU Munich
- 📅 Date & Time: Thursday 28 November 2024, 11:00 - 12:00
- 📍 Venue: GR04, English Faculty Building, 9 West Road, Sidgwick Site and online https://cam-ac-uk.zoom.us/j/97599459216?pwd=QTRsOWZCOXRTREVnbTJBdXVpOXFvdz09
Abstract
Abstract: Creating high-quality, diverse, large-scale datasets remains a critical and time-consuming challenge in improving LLM capabilities. Motivated by prior work that manually identifies implicit signals in raw corpora, we aim to address this challenge by investigating data repurposing strategies, a methodology for automatically transforming existing data resources into new formats and purposes. First, we propose reverse instructions to build an English instruction-following dataset by synthetically generating instructions for a given human-written corpus document. The model trained with our synthetic dataset performs significantly better than other instruction-following models, especially in long-form generation. Next, in MURI , we extend this approach to 200 languages to create a culturally inclusive, native dataset and multilingual instruction-following models for very low-resource languages. Finally, we customize this approach to generate any downstream dataset in CRAFT , targeting unannotated corpora to synthesize custom downstream task examples by retrieving and rewriting corpus documents using few-shot examples. Our experiments demonstrate that this approach can generate large-scale datasets for any given task, showing up to a 25% improvement in tasks such as biology QA and summarization compared to few-shot settings.
Bio: Abdullatif Köksal is a final-year ELLIS PhD student at CIS , LMU Munich and LTL , University of Cambridge, supervised by Prof. Hinrich Schütze and Prof. Anna Korhonen. His research focuses on improving LLM capabilities through effective data utilization and synthetic data generation. He has proposed several works around data repurposing by restructuring and augmenting existing data resources, including reverse instructions for long-form instruction-tuning and a culturally-respectful multilingual instruction-following dataset for 200 languages. He expanded these approaches to dataset generation for downstream tasks through better corpus mining with LLMs in CRAFT . He worked in other areas such as counterfactuality, robustness, and multilinguality and published multiple papers in top-tier NLP venues. He interned at Google and Amazon, where he worked on counterfactuality and faithfulness.
Series This talk is part of the Language Technology Lab Seminars series.
Included in Lists
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge talks
- Chris Davis' list
- GR04, English Faculty Building, 9 West Road, Sidgwick Site and online https://cam-ac-uk.zoom.us/j/97599459216?pwd=QTRsOWZCOXRTREVnbTJBdXVpOXFvdz09
- Guy Emerson's list
- Interested Talks
- Language Sciences for Graduate Students
- Language Technology Lab Seminars
- ndk22's list
- ob366-ai4er
- rp587
- Simon Baker's List
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)

Abdullatif Köksal, LMU Munich
Thursday 28 November 2024, 11:00-12:00