Looking at the A substitution, why doesn’t this argument work?
I think by “win a chess game against a grandmaster” you are specifically asking about the game itself. In real life we also have to arrange the game, stay alive until the game, etc. Let’s take all that out of scope, it’s obviously unsafe.
If there were a list of all the possible plans that win a chess game against a grandmaster, ranked by “likely to work”, most of the plans that might work route through “consequentalism”, and “acquire resources.”
Now, say you build an oracle AI. You’ve done all the things to try and make it interpretable and honest and such. If you ask it for a plan to win a chess game against a grandmaster, what happens?
Well it definitely doesn’t give you a plan like “If the grandmaster plays e4, you play e5, and then if they play d4, you play f5, …” because that plan is too large. I think the desired outcome is a plan like “open with pawn to d4, observe the board position, then ask for another plan”. Are Oracle AIs allowed to provide self-referential plans?
Regardless, if I’m an Oracle AI looking for the most likely plan, I’m now very concerned that you’ll have a heart attack, or an attack of arrogance, or otherwise mess up my perfect plan. Unlikely, sure, but I’m searching for the most “likely to work” here. So the actual plan I give you is “ask the grandmaster how his trip to Madrid went, then ask me for another plan”. Then the grandmaster realizes that I know about his(*eg) affair and will reveal it if he wins, and he attempts to lose as gracefully as possible. So now the outcome is much more robust to events.
I agree that highly agentic versions of the system will complete the tasks better. My claim is just that they’re not necessary to complete the task very well, and so we shouldn’t be confident that selection for completing that task very well will end up producing the highly agentic versions.
The part where alignment is hard is precisely when the thing I’m trying to accomplish is hard. Because then I need a powerful plan, and it’s hard to specify a search for powerful plans that don’t kill everyone.
I now read you as pointing to chess as:
It is “hard to accomplish” from the perspective of human cognition.
It does not require a “powerful”/”agentic” plan.
It’s “easy” to specify a search for a good plan, we already did it.
Yepp. And clearly alignment is much harder than chess, but it seems like an open question whether it’s harder than “kill everyone” (and even if it is, there’s an open question of how much of an advantage we get from doing our best to point the system at the former not the latter).
“Kill everyone” seems like it should be “easy”, because there are so many ways to do it: humans only survive in environments with a specific range of temperatures, pressures, atmospheric contents, availability of human-digestible food, &c.
Looking at the A substitution, why doesn’t this argument work?
I think by “win a chess game against a grandmaster” you are specifically asking about the game itself. In real life we also have to arrange the game, stay alive until the game, etc. Let’s take all that out of scope, it’s obviously unsafe.
Well it definitely doesn’t give you a plan like “If the grandmaster plays e4, you play e5, and then if they play d4, you play f5, …” because that plan is too large. I think the desired outcome is a plan like “open with pawn to d4, observe the board position, then ask for another plan”. Are Oracle AIs allowed to provide self-referential plans?
Regardless, if I’m an Oracle AI looking for the most likely plan, I’m now very concerned that you’ll have a heart attack, or an attack of arrogance, or otherwise mess up my perfect plan. Unlikely, sure, but I’m searching for the most “likely to work” here. So the actual plan I give you is “ask the grandmaster how his trip to Madrid went, then ask me for another plan”. Then the grandmaster realizes that I know about his(*eg) affair and will reveal it if he wins, and he attempts to lose as gracefully as possible. So now the outcome is much more robust to events.
I agree that highly agentic versions of the system will complete the tasks better. My claim is just that they’re not necessary to complete the task very well, and so we shouldn’t be confident that selection for completing that task very well will end up producing the highly agentic versions.
That helps, thanks. Raemon says:
I now read you as pointing to chess as:
It is “hard to accomplish” from the perspective of human cognition.
It does not require a “powerful”/”agentic” plan.
It’s “easy” to specify a search for a good plan, we already did it.
So maybe alignment is like that.
Yepp. And clearly alignment is much harder than chess, but it seems like an open question whether it’s harder than “kill everyone” (and even if it is, there’s an open question of how much of an advantage we get from doing our best to point the system at the former not the latter).
“Kill everyone” seems like it should be “easy”, because there are so many ways to do it: humans only survive in environments with a specific range of temperatures, pressures, atmospheric contents, availability of human-digestible food, &c.