If a “wrappermind” is just something that pursues a consistent set of values in the limit of absolute power, I’m not sure how we’re supposed to avoid such things arising. Suppose the AI that takes over the world does not hard-optimize over a goal, instead soft-optimizing or remaining not fully decided between a range of goals(and that humanity survives this AI’s takeover). What stops someone from building a wrappermind after such an AI has taken over? Seems like if you understood the AI’s value system, it would be pretty easy to construct a hard optimizer with the property that the optimum is something the AI can be convinced to find acceptable. As soon as your optimizer figures out how to do that it can go on its merry way approaching its optimum.
In order to prevent this from happening an AI must be able to detect when something is wrong. It must be able to, without fail, in potentially adversarial circumstances, recognize these kinds of Goodhart outcomes and robustly deem them unacceptable. But if your AI can do that, then any particular outcome it can be convinced to accept must not be a nightmare scenario. And therefore a “wrappermind” whose optimum was within this acceptable space would not be so bad.
In other words, if you know how to stop wrapperminds, you know how to build a good wrappermind.
If a “wrappermind” is just something that pursues a consistent set of values in the limit of absolute power, I’m not sure how we’re supposed to avoid such things arising. Suppose the AI that takes over the world does not hard-optimize over a goal, instead soft-optimizing or remaining not fully decided between a range of goals(and that humanity survives this AI’s takeover). What stops someone from building a wrappermind after such an AI has taken over? Seems like if you understood the AI’s value system, it would be pretty easy to construct a hard optimizer with the property that the optimum is something the AI can be convinced to find acceptable. As soon as your optimizer figures out how to do that it can go on its merry way approaching its optimum.
In order to prevent this from happening an AI must be able to detect when something is wrong. It must be able to, without fail, in potentially adversarial circumstances, recognize these kinds of Goodhart outcomes and robustly deem them unacceptable. But if your AI can do that, then any particular outcome it can be convinced to accept must not be a nightmare scenario. And therefore a “wrappermind” whose optimum was within this acceptable space would not be so bad.
In other words, if you know how to stop wrapperminds, you know how to build a good wrappermind.