Seth Herd comments on RobertM’s Shortform

Seth Herd Dec 31, 2024, 8:32 PM
2 points
0
We do have a poor understanding of human values. That’s one more reason we shouldn’t and probably won’t try to build them into AGI.
You’re expressing a common view among the alignment community. I think we should update from that view to the more likely scenario in which we don’t even try to align AGI to human values.
What we’re actually doing is training LLMs to answer questions as they were intended, and to follow instructions as they were intended. The AI needs to understand human values to some degree to do that, but training is really focused on those things. There’s an interesting bit in this interview with Tan Zhi Xuan on this distinction between theory and practice of training LLMs, and to a lesser degree in their paper.
Not only is that what we are doing for current AI, I think it’s both what we should do for future AGI, and what we probably will do. Instruction-following AGI is easier and more likely than value aligned AGI.
It’s counterintuitive to think about a highly intelligent agent that wants to do what someone else tells it. But it’s not logically incoherent.
And when the first human decides what goal to put in the system prompt of the first agent they think might ultimately surpass human competence and intelligence, there’s little doubt what they’ll put there: “follow my instructions, favoring the most recent”. Everything else is a subgoal of that non-consequentialist central goal.
This approach leaves humans in charge, and that’s a problem. Ultimately I think that sort of instrucion-following intent alignment can be a stepping-stone to value alignment, once we’ve got a superintelligent instruction-following system to help us with that very difficult problem. But there’s neither a need nor an incentive to aim directly at that with our first AGIs. So alignment will succeed or fail on other issues.
Separately, I fully agree that most people who don’t believe in AGI x-risk aren’t making a true rejection. They usually really don’t believe we’ll make autonomous AGI soon enough to worry about it.

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer