Tomek Korbak

Karma: 729

Senior Research Scientist at UK AISI working on AI control

https://tomekkorbak.com/

A sketch of an AI control safety case

Tomek Korbak, joshc, Benjamin Hilton, Buck and Geoffrey Irving

Jan 30, 2025, 5:28 PM

57 points

0 comments5 min readLW link

Eliciting bad contexts

Geoffrey Irving, Joseph Bloom and Tomek Korbak

Jan 24, 2025, 10:39 AM

31 points

8 comments3 min readLW link

Automation collapse

Geoffrey Irving, Tomek Korbak and Benjamin Hilton

Oct 21, 2024, 2:50 PM

72 points

9 comments7 min readLW link

Tomek Korbak Oct 25, 2023, 8:24 PM
3 points
0
in reply to: johnswentworth’s comment on: Compositional preference models for aligning LMs
Fair point, I’m using “compositional” in an informal sense different from the one in formal semantics, closer to what I called “trivial compositionally” in this paper. But I’d argue it’s not totally crazy to call such preference models compositional and that compositionally here still has some resemblance to Montague’s account of compositionally as homeomorphism: basically, you have get_total_score(response) == sum([get_score(attribute) for attribute in decompose(response)])

Compositional preference models for aligning LMs

Tomek KorbakOct 25, 2023, 12:17 PM

18 points

2 comments5 min readLW link

Towards Understanding Sycophancy in Language Models

Ethan Perez, mrinank_sharma, Meg and Tomek Korbak

Oct 24, 2023, 12:30 AM

66 points

0 comments2 min readLW link

(arxiv.org)

Tomek Korbak Oct 17, 2023, 11:45 AM
2 points
0
on: [Paper] All’s Fair In Love And Love: Copy Suppression in GPT-2 Small
Cool work! Reminds me a bit of my submission to the inverse scaling prize: https://tomekkorbak.com/2023/03/21/repetition-supression/

Paper: LLMs trained on “A is B” fail to learn “B is A”

lberglund, Owain_Evans, Meg, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland and Tomek Korbak

Sep 23, 2023, 7:55 PM

121 points

74 comments4 min readLW link

(arxiv.org)

Paper: On measuring situational awareness in LLMs

Owain_Evans, Daniel Kokotajlo, Mikita Balesni, Tomek Korbak, Asa Cooper Stickland, Meg and Maximilian Kaufmann

Sep 4, 2023, 12:54 PM

109 points

16 comments5 min readLW link

(arxiv.org)

Imitation Learning from Language Feedback

Jérémy Scheurer, Tomek Korbak and Ethan Perez

Mar 30, 2023, 2:11 PM

71 points

3 comments10 min readLW link

Tomek Korbak Mar 27, 2023, 5:13 PM
1 point
0
in reply to: platers’s comment on: Pretraining Language Models with Human Preferences
In practice I think using a trained reward model (as in RLHF), not fixed labels, is the way forward. Then the cost of acquiring the reward model is the same as in RLHF, the difference is primarily that PHF typically needs much more calls to the reward model than RLHF.

Tomek Korbak Mar 21, 2023, 11:47 AM
7 points
2
on: Remarks 1–18 on GPT (compressed)
Thanks, I found the post quite stimulating. Some questions and thoughts:
1. Is LLM dynamics ergodic? I.e. is the time average $P^{\infty}$ equal to ${lim}_{N \to \infty} \frac{1}{N} \sum_{n}^{N} π_{n} (0)$ , the average page vector?.
2. One potential issue with this formalisation is that you always assume a prompt of size $k$ (so you need to introduce artificial “null tokens” if the prompt is shorter) and you don’t give special treatment to the token <|endoftext|>. For me, it would be more intuitive to consider LLM dynamics in terms of finite, variable length, token-level Markov chains (until <|endoftext|>). While a fixed block size is actually being used during training, the LLM is incentivised to disregard anything before <|endoftext|>. So these two prompts should induce the same distribution: Document about cats.<|endoftext|>My name is; Document about dogs.<|endoftext|>My name is. Your formalisation doesn’t account for this symmetry.
3. Dennett is spelled with “tt”.
4. Note that a softmax-based LLM will always put non-zero probability on every token. So there are no strictly absorbing states. You’re careful enough to define absorbing states as “once you enter, you are unlikely to ever leave”, but then your toy Waluigi model is implausible. A Waluigi can always switch back to a Luigi.

Tomek Korbak Mar 4, 2023, 1:04 PM
1 point
0
in reply to: Evan R. Murphy’s comment on: Pretraining Language Models with Human Preferences
I don’t remember where I saw that, but something as dumb as subtracting the embedding of <|bad|> might even work sometimes.

Tomek Korbak Mar 4, 2023, 1:01 PM
2 points
−1
in reply to: Evan R. Murphy’s comment on: Pretraining Language Models with Human Preferences
That’s a good point. But if you’re using a distilled, inference-bandwith-optimised RM, annotating your training data might be a fraction of compute needed for pretraining.

Also, the cost of annotation is constant and can be amortized over many training runs. PHF shares an important advantage of offline RL over online RL approaches (such as RLHF): being able to reuse feedback annotations across experiments. If you already have a dataset, running a hyperparameter sweep on it is as cheap as standard pretraining and in contrast with RLHF you don’t need to recompute rewards.

Tomek Korbak Feb 28, 2023, 10:53 AM
8 points
0
in reply to: Logan Riggs’s comment on: Pretraining Language Models with Human Preferences
For filtering it was 25% of best scores, so we effectively trained for 4 epochs.

(We had different threshold for filtering and conditional training, note that we filter at document level but condition at sentence level.)

Tomek Korbak Feb 24, 2023, 2:52 PM
2 points
0
in reply to: cherrvak’s comment on: Pretraining Language Models with Human Preferences
Good question! We’re not sure. The fact that PHF scales well with dataset size might provide weak evidence that it would scale well with model size too.

Tomek Korbak Feb 23, 2023, 6:17 PM
1 point
0
in reply to: Insub’s comment on: Pretraining Language Models with Human Preferences

I’m guessing that poison-pilling the <|bad|> sentences would have a negative effect on the <|good|> capabilities as well?

That would be my guess too.

Tomek Korbak Feb 23, 2023, 6:05 PM
LW: 4 AF: 2
0
AF
in reply to: Robert_AIZI’s comment on: Pretraining Language Models with Human Preferences

Have you tested the AI’s outputs when run in <|bad|> mode instead of <|good|> mode?

We did, LMs tends to generate toxic text when conditioned on <|bad|>. Though we tended to have a risk-aversive thresholds, i.e. we used <|good|> for only about 5% safest sentences and <|bad|> for the remaining 95%. So <|bad|> is not bad all the time.

Here it would be helpful to know what the AI produces when prompted by <|bad|>.

That’s a good point. We haven’t systematically investigate difference in capabilities between<|good|> and <|bad|> modes, I’d love to see that.

Just before public release, one could delete the <|bad|> token from the tokenizer and the model parameters, so switching to evil mode would require rediscovering that token embedding.

Yeah, you could even block the entire direction in activation space corresponding to the embedding of the <|bad|> token

Pretraining Language Models with Human Preferences

Tomek Korbak, Sam Bowman and Ethan Perez

Feb 21, 2023, 5:57 PM

135 points

20 comments11 min readLW link 2 reviews

Tomek Korbak Nov 22, 2022, 11:25 AM
3 points
in reply to: evhub’s comment on: RL with KL penalties is better seen as Bayesian inference
fixed, thanks!