RSS

Stuart_Armstrong

Karma: 17,958

Go home GPT-4o, you’re drunk: emer­gent mis­al­ign­ment as low­ered inhibitions

Mar 18, 2025, 2:48 PM
77 points
12 comments5 min readLW link

Us­ing Prompt Eval­u­a­tion to Com­bat Bio-Weapon Research

Feb 19, 2025, 12:39 PM
11 points
2 comments3 min readLW link

Defense Against the Dark Prompts: Miti­gat­ing Best-of-N Jailbreak­ing with Prompt Evaluation

Jan 31, 2025, 3:36 PM
16 points
2 comments2 min readLW link

Align­ment can im­prove gen­er­al­i­sa­tion through more ro­bustly do­ing what a hu­man wants—CoinRun example

Stuart_ArmstrongNov 21, 2023, 11:41 AM
67 points
9 comments3 min readLW link

How toy mod­els of on­tol­ogy changes can be misleading

Stuart_ArmstrongOct 21, 2023, 9:13 PM
42 points
0 comments2 min readLW link

Differ­ent views of al­ign­ment have differ­ent con­se­quences for im­perfect methods

Stuart_ArmstrongSep 28, 2023, 4:31 PM
31 points
0 comments1 min readLW link

Avoid­ing xrisk from AI doesn’t mean fo­cus­ing on AI xrisk

Stuart_ArmstrongMay 2, 2023, 7:27 PM
66 points
7 comments3 min readLW link

What is a defi­ni­tion, how can it be ex­trap­o­lated?

Stuart_ArmstrongMar 14, 2023, 6:08 PM
34 points
5 comments7 min readLW link

You’re not a simu­la­tion, ’cause you’re hallucinating

Stuart_ArmstrongFeb 21, 2023, 12:12 PM
25 points
6 comments1 min readLW link

Large lan­guage mod­els can provide “nor­ma­tive as­sump­tions” for learn­ing hu­man preferences

Stuart_ArmstrongJan 2, 2023, 7:39 PM
29 points
12 comments3 min readLW link

Con­cept ex­trap­o­la­tion for hy­poth­e­sis generation

Dec 12, 2022, 10:09 PM
20 points
2 comments3 min readLW link

Us­ing GPT-Eliezer against ChatGPT Jailbreaking

Dec 6, 2022, 7:54 PM
170 points
85 comments9 min readLW link

Bench­mark for suc­cess­ful con­cept ex­trap­o­la­tion/​avoid­ing goal misgeneralization

Stuart_ArmstrongJul 4, 2022, 8:48 PM
82 points
12 comments4 min readLW link

Value ex­trap­o­la­tion vs Wireheading

Stuart_ArmstrongJun 17, 2022, 3:02 PM
16 points
1 comment1 min readLW link

Ge­or­gism, in theory

Stuart_ArmstrongJun 15, 2022, 3:20 PM
40 points
22 comments4 min readLW link

How to get into AI safety research

Stuart_ArmstrongMay 18, 2022, 6:05 PM
44 points
7 comments1 min readLW link

GPT-3 and con­cept extrapolation

Stuart_ArmstrongApr 20, 2022, 10:39 AM
19 points
27 comments1 min readLW link

Con­cept ex­trap­o­la­tion: key posts

Stuart_ArmstrongApr 19, 2022, 10:01 AM
13 points
2 comments1 min readLW link

AIs should learn hu­man prefer­ences, not biases

Stuart_ArmstrongApr 8, 2022, 1:45 PM
10 points
0 comments1 min readLW link

Differ­ent per­spec­tives on con­cept extrapolation

Stuart_ArmstrongApr 8, 2022, 10:42 AM
48 points
8 comments5 min readLW link1 review