Argument that you can’t use a boundedly intelligent ELK solution to search over plans to find one which keeps the diamond in the vault. That is, the ELK solution probably would have to be at least as smart (or smarter) than the plan-generator.
Consider any situation where it’s hard to keep the diamond in the vault. Then any successful plan will have relatively few degrees of freedom. Like, a bunch of really smart thieves will execute a cunning plot to extract the diamond. You can’t just sit by or deploy some simple traps in this situation.
Therefore, any plan which keeps the diamond in the vault, will be “smart” in some way.
Now notice that it’s possible to build a successor AI which happens to trick the oracle (i.e. the assumed ELK solution) into believing the diamond is still in the vault. The reason is that even though the oracle is superintelligent, it is still flawed in unforeseeable-to-us ways, its mind has “cracks” which can be exploited by unpredictable-to-us adversarial inputs (loose analogy: epilepsy). Inputs which would be findable by brute-force-search. So there exist plans which trick the oracle. Even if the oracle understands the dangers of successor AIs in general, even if the oracle is cautious in general, any powerful enough brute force search will expose exploits in its cognition.
Now, consider that if object-level diamond-protecting plans have to be sufficiently complex in order to retain the diamond, they might be far longer than plans which simply build a successor AI which takes over the world and tricks the oracle into saying “the diamond is still in the room”. Such AI-creation plans would have many degrees of freedom, since such AIs probably have low K-complexity and admit short programs (findable by brute force).
Therefore, if you just do brute-force search over all plans, and randomly sample one where the oracle returns yes the diamond is still there, there would be exponentially many more plans which have a successor AI, than which involve protecting the diamond for real. And so somehow we still die to an unaligned AI, even given the ELK oracle, and the diamond probably doesn’t even stay in the room.
I was prompted into realizing this by some mixture of talking with Rohin Shah and John Wentworth. Maybe most of the idea was John’s (though he might not agree with my recounting of our discussion).
But anyways, in the moment of realization, my suspension of disbelief got unsuspended. I snapped. I thought “there is actually no way that any of this is on the real path[1] to solving alignment, this is just too insane.” (And I began looking for other frames on the alignment problem, preparing to rethink the problem from scratch after I graduated.)
Brute-force-search probably isn’t what ELK proponents would suggest. But it sure seems like, given an ELK solution, you should be able to simply find plans which at least keep the diamond in the room, without a whole bunch more alignment work.
That is, the ELK solution probably would have to be at least as smart (or smarter) than the plan-generator.
[...]
But anyways, in the moment of realization, my suspension of disbelief got unsuspended. I snapped. I thought “there is actually no way that any of this is on the real path to solving alignment, this is just too insane.”
In my work I don’t shy away from exotic problems (I often find them useful as extreme cases to illustrate some principle). At the same time, I’m aiming for mundane solutions and optimistic about finding them.
I think those positions are consistent because my intermediate goal is to ensure that the oversight process is able to leverage all of the capabilities developed by the model — so if the model develops exotic capabilities which pose exotic challenges, then we get an exotic oversight process automatically
Argument that you can’t use a boundedly intelligent ELK solution to search over plans to find one which keeps the diamond in the vault. That is, the ELK solution probably would have to be at least as smart (or smarter) than the plan-generator.
Consider any situation where it’s hard to keep the diamond in the vault. Then any successful plan will have relatively few degrees of freedom. Like, a bunch of really smart thieves will execute a cunning plot to extract the diamond. You can’t just sit by or deploy some simple traps in this situation.
Therefore, any plan which keeps the diamond in the vault, will be “smart” in some way.
Now notice that it’s possible to build a successor AI which happens to trick the oracle (i.e. the assumed ELK solution) into believing the diamond is still in the vault. The reason is that even though the oracle is superintelligent, it is still flawed in unforeseeable-to-us ways, its mind has “cracks” which can be exploited by unpredictable-to-us adversarial inputs (loose analogy: epilepsy). Inputs which would be findable by brute-force-search. So there exist plans which trick the oracle. Even if the oracle understands the dangers of successor AIs in general, even if the oracle is cautious in general, any powerful enough brute force search will expose exploits in its cognition.
Now, consider that if object-level diamond-protecting plans have to be sufficiently complex in order to retain the diamond, they might be far longer than plans which simply build a successor AI which takes over the world and tricks the oracle into saying “the diamond is still in the room”. Such AI-creation plans would have many degrees of freedom, since such AIs probably have low K-complexity and admit short programs (findable by brute force).
Therefore, if you just do brute-force search over all plans, and randomly sample one where the oracle returns
yes the diamond is still there
, there would be exponentially many more plans which have a successor AI, than which involve protecting the diamond for real. And so somehow we still die to an unaligned AI, even given the ELK oracle, and the diamond probably doesn’t even stay in the room.I was prompted into realizing this by some mixture of talking with Rohin Shah and John Wentworth. Maybe most of the idea was John’s (though he might not agree with my recounting of our discussion).
But anyways, in the moment of realization, my suspension of disbelief got unsuspended. I snapped. I thought “there is actually no way that any of this is on the real path[1] to solving alignment, this is just too insane.” (And I began looking for other frames on the alignment problem, preparing to rethink the problem from scratch after I graduated.)
Brute-force-search probably isn’t what ELK proponents would suggest. But it sure seems like, given an ELK solution, you should be able to simply find plans which at least keep the diamond in the room, without a whole bunch more alignment work.
The main hope is to have the ELK solution be at least as smart as the plan-generator. See mundane solutions to exotic problems: