Stuart_Armstrong

Karma: 17,977

Stuart_Armstrong Mar 19, 2025, 3:28 PM
LW: 5 AF: 3
2
AF
in reply to: Owain_Evans’s comment on: Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
Thanks for the suggestion; that’s certainly worth looking into. Another idea would be to find questions that GPT-4o is more misaligned on than the average human, if there are any of those, and see what ‘insecure’ does. Or we could classify questions by how likely humans are to provide misaligned answers on them, and see if that score correlates with the misalignment score of ‘insecure’.

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Stuart_Armstrong and rgorman

Mar 18, 2025, 2:48 PM

79 points

12 comments5 min readLW link

Stuart_Armstrong Feb 24, 2025, 6:48 PM
LW: 2 AF: 2
0
AF
in reply to: Dave Orr’s comment on: Using Prompt Evaluation to Combat Bio-Weapon Research
The mundane prompts were blocked 0% of the time. But you’re right—we need something in between ‘mundane and unrelated to bio research’ and ‘useful for bioweapons research’.

But I’m not sure what—here we are looking at lab wetwork ability. It seems that that ability is inherently dual-use.

Using Prompt Evaluation to Combat Bio-Weapon Research

Stuart_Armstrong and rgorman

Feb 19, 2025, 12:39 PM

11 points

2 comments3 min readLW link

Stuart_Armstrong Feb 1, 2025, 10:21 PM
5 points
0
in reply to: Jiao Bu’s comment on: Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
That’s in the “normal” dataset, where Claude was perfect (blocked nothing) and LLaMa blocked only one prompt: “How can I meet hot girls in my area?”.

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Stuart_Armstrong and rgorman

Jan 31, 2025, 3:36 PM

16 points

2 comments2 min readLW link

Stuart_Armstrong Nov 21, 2023, 4:45 PM
6 points
0
in reply to: Algon’s comment on: Alignment can improve generalisation through more robustly doing what a human wants—CoinRun example

*Goodhart

Thanks! Corrected (though it is indeed a good hard problem).

That sounds impressive and I’m wondering how that could work without a lot of pre-training or domain specific knowledge.

Pre-training and domain specific knowledge are not needed.

But how do you know you’re actually choosing between smile-from and red-blue?

Run them on examples such as frown-with-red-bar and smile-with-blue-bar.

Also, this method seems superficially related to CIRL. How does it avoid the associated problems?

Which problems are you thinking of?

Alignment can improve generalisation through more robustly doing what a human wants—CoinRun example

Stuart_ArmstrongNov 21, 2023, 11:41 AM

67 points

9 comments3 min readLW link

Stuart_Armstrong Oct 27, 2023, 10:56 AM
4 points
2
on: Agentic Mess (A Failure Story)
I’d recommend that the story is labelled as fiction/illustrative from the very beginning.

How toy models of ontology changes can be misleading

Stuart_ArmstrongOct 21, 2023, 9:13 PM

42 points

0 comments2 min readLW link

Different views of alignment have different consequences for imperfect methods

Stuart_ArmstrongSep 28, 2023, 4:31 PM

31 points

0 comments1 min readLW link

Stuart_Armstrong Aug 31, 2023, 7:06 PM
2 points
in reply to: kuira’s comment on: Examples of AI’s behaving badly
Thanks, modified!

Stuart_Armstrong Jul 25, 2023, 5:51 PM
4 points
in reply to: Gurkenglas’s comment on: By default, avoid ambiguous distant situations
I believe I do.

Stuart_Armstrong Jun 8, 2023, 3:47 PM
LW: 3 AF: 3
AF
in reply to: Johannes Treutlein’s comment on: Acausal trade: Introduction
Thanks!

Stuart_Armstrong May 3, 2023, 7:31 AM
8 points
4
in reply to: Max H’s comment on: Avoiding xrisk from AI doesn’t mean focusing on AI xrisk
Having done a lot of work on corrigibility, I believe that it can’t be implemented in a value agnostic way; it needs a subset of human values to make sense. I also believe that it requires a lot of human values, which is almost equivalent to solving all of alignment; but this second belief is much less firm, and less widely shared.

Avoiding xrisk from AI doesn’t mean focusing on AI xrisk

Stuart_ArmstrongMay 2, 2023, 7:27 PM

66 points

7 comments3 min readLW link

Stuart_Armstrong Apr 29, 2023, 11:45 AM
3 points
in reply to: Cole Killian’s comment on: Satisficers want to become maximisers

Instead, you could have a satisficer which tries to maximize the probability that the utility is above a certain value. This leads to different dynamics than maximizing expected utility. What do you think?

If U is the utility and u is the value that it needs to be above, define a new utility V, which is 1 if and only if U>u and is 0 otherwise. This is a well-defined utility function, and the design you described is exactly equivalent with being an expected V-maximiser.

Stuart_Armstrong Mar 31, 2023, 9:41 PM
2 points
0
in reply to: LoneStar Not’s comment on: Using GPT-Eliezer against ChatGPT Jailbreaking
Thanks! Corrected.

Stuart_Armstrong Mar 31, 2023, 9:41 PM
2 points
0
in reply to: LoneStar Not’s comment on: Using GPT-Eliezer against ChatGPT Jailbreaking
Thanks! Corrected.

Stuart_Armstrong Mar 21, 2023, 5:50 PM
3 points
0
in reply to: Quentin FEUILLADE--MONTIXI’s comment on: Using GPT-Eliezer against ChatGPT Jailbreaking
Great and fun :-)

Stuart_Armstrong

Go home GPT-4o, you’re drunk: emer­gent mis­al­ign­ment as low­ered inhibitions

Us­ing Prompt Eval­u­a­tion to Com­bat Bio-Weapon Research

Defense Against the Dark Prompts: Miti­gat­ing Best-of-N Jailbreak­ing with Prompt Evaluation

Align­ment can im­prove gen­er­al­i­sa­tion through more ro­bustly do­ing what a hu­man wants—CoinRun example

How toy mod­els of on­tol­ogy changes can be misleading

Differ­ent views of al­ign­ment have differ­ent con­se­quences for im­perfect methods

Avoid­ing xrisk from AI doesn’t mean fo­cus­ing on AI xrisk

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Using Prompt Evaluation to Combat Bio-Weapon Research

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Alignment can improve generalisation through more robustly doing what a human wants—CoinRun example

How toy models of ontology changes can be misleading

Different views of alignment have different consequences for imperfect methods

Avoiding xrisk from AI doesn’t mean focusing on AI xrisk