Plans obviously need some robustness to things going wrong, and in a sense I agree with John Wentworth, if weakly, that some robustness is a necessary feature of a plan, and some verification is actually necessary.
But I have to agree that there is a real failure mode identified by moridinamael and Quintin Pope, and that is perfectionism, meaning that you discard ideas too quickly as not useful, and this constraint is the essence of perfectionism:
I have an exercise where I give people the instruction to play a puzzle game (“Baba is You”), but where you normally have the ability to move around and interact with the world to experiment and learn things, instead, you need to make a complete plan for solving the level, and you aim to get it right on your first try.
It asks for both a complete plan to solve the whole level, and also asks for the plan to work on the first try, which outside of this context implies either the problem is likely unsolvable or you are being too perfectionist with your demands.
In particular, I think that Quintin Pope’s comment here is genuinely something that applies in lots of science and problem solving, and that it’s actually quite difficult to reasoin well about the world in general without many experiments.
I think the concern is that, if plans need some verification, it may be impossible to align smarter-than-human AGI. In order to verify those plans, we’d have to build one. If the plan doesn’t work (isn’t verified), that may be the end of us—no retries possible.
There are complex arguments on both sides, so I’m not arguing this is strictly true. I just wanted to clarify that that’s the concern and the point of asking people to solve it on the first try. I think ultimately this is partly but not 100% true of ASI alignment, and clarifying exactly how and to what degree we can verify plans empirically is critical to the project. Having a plan verified on weak, non-autonomous/self-aware/agentic systems may or may not generalize to that plan working on smarter systems with those properties. Some of the ways verification will or won’t generalize can probably be identified with careful analysis of how such systems will be functionally different.
Plans obviously need some robustness to things going wrong, and in a sense I agree with John Wentworth, if weakly, that some robustness is a necessary feature of a plan, and some verification is actually necessary.
But I have to agree that there is a real failure mode identified by moridinamael and Quintin Pope, and that is perfectionism, meaning that you discard ideas too quickly as not useful, and this constraint is the essence of perfectionism:
It asks for both a complete plan to solve the whole level, and also asks for the plan to work on the first try, which outside of this context implies either the problem is likely unsolvable or you are being too perfectionist with your demands.
In particular, I think that Quintin Pope’s comment here is genuinely something that applies in lots of science and problem solving, and that it’s actually quite difficult to reasoin well about the world in general without many experiments.
I think the concern is that, if plans need some verification, it may be impossible to align smarter-than-human AGI. In order to verify those plans, we’d have to build one. If the plan doesn’t work (isn’t verified), that may be the end of us—no retries possible.
There are complex arguments on both sides, so I’m not arguing this is strictly true. I just wanted to clarify that that’s the concern and the point of asking people to solve it on the first try. I think ultimately this is partly but not 100% true of ASI alignment, and clarifying exactly how and to what degree we can verify plans empirically is critical to the project. Having a plan verified on weak, non-autonomous/self-aware/agentic systems may or may not generalize to that plan working on smarter systems with those properties. Some of the ways verification will or won’t generalize can probably be identified with careful analysis of how such systems will be functionally different.