Daniel Kokotajlo answers Can we ever ensure AI alignment if we can only test AI personas?

Daniel Kokotajlo Mar 16, 2025, 6:08 PM
9 points
3
The way I’d put it is that there are many personas that a person or LLM can play—many masks they can wear—and what we care about is which one they wear in high-stakes situations where e.g. they have tons of power and autonomy and no one is able to check what they are doing or stop them. (You can perhaps think of this one as the “innermost mask”)

The problem you are pointing to, I think, is that behavioral training is insufficient for assurance-of-alignment, and probably insufficient for alignment, full stop.

This doesn’t mean it’s theoretically impossible to align superhuman AIs. It can be done, but your alignment techniques will need to be more sophisticated than “We trained it to behave in ways that appear nice to us, and so far it seems to be doing so.” For example they may involve mechanistic interpretability.
- eggsyntax Mar 16, 2025, 6:23 PM
  8 points
  2
  Parent
  I agree with Daniel here but would add one thing:
  what we care about is which one they wear in high-stakes situations where e.g. they have tons of power and autonomy and no one is able to check what they are doing or stop them. (You can perhaps think of this one as the “innermost mask”)
  I think there are also valuable questions to be asked about attractors in persona space—what personas does an LLM gravitate to across a wide range of scenarios, and what sorts of personas does it always or never adopt? I’m not aware of much existing research in this direction, but it seems valuable. If for example we could demonstrate certain important bounds (‘This LLM will never adopt a mass-murderer persona’) there’s potential alignment value there IMO.
  - Karl von Wendt Mar 16, 2025, 7:53 PM
    1 point
    0
    Parent
    This is also a very interesting point, thank you!
- Karl von Wendt Mar 16, 2025, 7:51 PM
  3 points
  0
  Parent
  Thank you! That helps me understanding the problem better, although I’m quite skeptical about mechanistic interpretability.

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer