On one side of me sits Eliezer, suggesting that future powerful AGIs will make decisions exclusively to advance their explicit preferences over future states
On the other side of me sits, umm, you, and maybe Richard Ngo, and some of the “tool AI” and GPT-3-enthusiast people, declaring that future powerful AGIs will make decisions based on no explicit preference whatsoever over future states.
Here I am in the middle, advocating that we make AGIs that do have preferences over future states, but also have other preferences.
I disagree with the 2nd camp for the same reason Eliezer does: I don’t think those AIs are powerful enough. More specifically: We already have neat AIs like GPT-3 that can do lots of neat things. But we have a big problem: sooner or later, somebody is going to come along and build a dangerous accident-prone consequentialist AGI. We need an AI that’s both safe, and powerful enough to solve that big problem. I usually operationalize that as “able to come up with good original creative ideas in alignment research, and/or able to invent powerful new technologies”. I think that, for an AI to do those things, it needs to do explicit means-end reasoning, autonomously come up with new instrumental goals and pursue them, etc. etc. For example, see discussion of “RL-on-thoughts” here.
“humans will remain in control” [is a] statement about the future.
“Humans will eventually wind up in control” is purely about future states. “Humans will remain in control” is not. For example, consider a plan that involves disempowering humans and then later re-empowering them. That plan would pattern-match well to “humans will eventually wind up in control”, but it would pattern-match poorly to “humans will remain in control”.
If the “humans will remain in control” value function has bugs (and it will) then the machine will turn the universe into paperclips.
Yes, this is a very important potential problem, see my discussion under “Objection 1”.
Thanks for the comment!
I feel like I’m stuck in the middle…
On one side of me sits Eliezer, suggesting that future powerful AGIs will make decisions exclusively to advance their explicit preferences over future states
On the other side of me sits, umm, you, and maybe Richard Ngo, and some of the “tool AI” and GPT-3-enthusiast people, declaring that future powerful AGIs will make decisions based on no explicit preference whatsoever over future states.
Here I am in the middle, advocating that we make AGIs that do have preferences over future states, but also have other preferences.
I disagree with the 2nd camp for the same reason Eliezer does: I don’t think those AIs are powerful enough. More specifically: We already have neat AIs like GPT-3 that can do lots of neat things. But we have a big problem: sooner or later, somebody is going to come along and build a dangerous accident-prone consequentialist AGI. We need an AI that’s both safe, and powerful enough to solve that big problem. I usually operationalize that as “able to come up with good original creative ideas in alignment research, and/or able to invent powerful new technologies”. I think that, for an AI to do those things, it needs to do explicit means-end reasoning, autonomously come up with new instrumental goals and pursue them, etc. etc. For example, see discussion of “RL-on-thoughts” here.
“Humans will eventually wind up in control” is purely about future states. “Humans will remain in control” is not. For example, consider a plan that involves disempowering humans and then later re-empowering them. That plan would pattern-match well to “humans will eventually wind up in control”, but it would pattern-match poorly to “humans will remain in control”.
Yes, this is a very important potential problem, see my discussion under “Objection 1”.