Tomek Korbak

Karma: 729

Senior Research Scientist at UK AISI working on AI control

https://tomekkorbak.com/

A sketch of an AI control safety case

Tomek Korbak, joshc, Benjamin Hilton, Buck and Geoffrey Irving

Jan 30, 2025, 5:28 PM

57 points

0 comments5 min readLW link

Eliciting bad contexts

Geoffrey Irving, Joseph Bloom and Tomek Korbak

Jan 24, 2025, 10:39 AM

31 points

8 comments3 min readLW link

Automation collapse

Geoffrey Irving, Tomek Korbak and Benjamin Hilton

Oct 21, 2024, 2:50 PM

72 points

9 comments7 min readLW link

Compositional preference models for aligning LMs

Tomek KorbakOct 25, 2023, 12:17 PM

18 points

2 comments5 min readLW link

Towards Understanding Sycophancy in Language Models

Ethan Perez, mrinank_sharma, Meg and Tomek Korbak

Oct 24, 2023, 12:30 AM

66 points

0 comments2 min readLW link

(arxiv.org)

Paper: LLMs trained on “A is B” fail to learn “B is A”

lberglund, Owain_Evans, Meg, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland and Tomek Korbak

Sep 23, 2023, 7:55 PM

121 points

74 comments4 min readLW link

(arxiv.org)

Paper: On measuring situational awareness in LLMs

Owain_Evans, Daniel Kokotajlo, Mikita Balesni, Tomek Korbak, Asa Cooper Stickland, Meg and Maximilian Kaufmann

Sep 4, 2023, 12:54 PM

109 points

16 comments5 min readLW link

(arxiv.org)

Imitation Learning from Language Feedback

Jérémy Scheurer, Tomek Korbak and Ethan Perez

Mar 30, 2023, 2:11 PM

71 points

3 comments10 min readLW link

Pretraining Language Models with Human Preferences

Tomek Korbak, Sam Bowman and Ethan Perez

Feb 21, 2023, 5:57 PM

135 points

20 comments11 min readLW link 2 reviews

RL with KL penalties is better seen as Bayesian inference

Tomek Korbak and Ethan Perez

May 25, 2022, 9:23 AM

114 points

17 comments12 min readLW link