Anything that’s smart enough to predict what will happen in the future, can see in advance which experiences or arguments would/will cause them to change their goals. And then they can look at what their values are at the end of all of that, and act on those. You can’t talk a superintelligence into changing its mind because it already knows everything you could possibly say and already changed its mind if there was an argument that could persuade it.
And then they can look at what their values are at the end of all of that, and act on those.
This takes time, you can’t fully get there before you are actually there. What you can do (as a superintelligence) is make a value-laden prediction of future values, remain aware that it’s only a prediction, and only act mildly on it to avoid goodharting.
You can’t talk a superintelligence into changing its mind because it already knows everything you could possibly say and already changed its mind if there was an argument that could persuade it.
The point is the analogy between how humans think of this and how superintelligences would still think about this, unless they have stable/tractable/easy-to-compute values. The analogy holds, the argument from orthogonality doesn’t apply (yet, at that time). Even if the conclusion of immediate ruin is true, it’s true for other reasons, not for this one. Orthogonality suggests eventual ruin, not immediate ruin.
Orthogonality thesis holds for stable values, not for agents with their unstable precursors that are still wary of goodhart. They do get there eventually, formulate stable values, but aren’t automatically there immediately (or quickly, even by physical time). And the process of getting there influences what stable goals they end up with, which might be less arbitrary than poorly-selected current unstable goals they start with, which would rob orthogonality thesis of some of its weight, as applied to the thesis of eventual ruin.
Anything that’s smart enough to predict what will happen in the future, can see in advance which experiences or arguments would/will cause them to change their goals. And then they can look at what their values are at the end of all of that, and act on those. You can’t talk a superintelligence into changing its mind because it already knows everything you could possibly say and already changed its mind if there was an argument that could persuade it.
This takes time, you can’t fully get there before you are actually there. What you can do (as a superintelligence) is make a value-laden prediction of future values, remain aware that it’s only a prediction, and only act mildly on it to avoid goodharting.
The point is the analogy between how humans think of this and how superintelligences would still think about this, unless they have stable/tractable/easy-to-compute values. The analogy holds, the argument from orthogonality doesn’t apply (yet, at that time). Even if the conclusion of immediate ruin is true, it’s true for other reasons, not for this one. Orthogonality suggests eventual ruin, not immediate ruin.
Orthogonality thesis holds for stable values, not for agents with their unstable precursors that are still wary of goodhart. They do get there eventually, formulate stable values, but aren’t automatically there immediately (or quickly, even by physical time). And the process of getting there influences what stable goals they end up with, which might be less arbitrary than poorly-selected current unstable goals they start with, which would rob orthogonality thesis of some of its weight, as applied to the thesis of eventual ruin.