If we don’t define “optimal” properly it should be able to find a suitable definition on its own by imagining what we might have meant.
But it wouldn’t want to. If we mistakenly define ‘optimal’ to mean ‘really good at calculating pi’ then it won’t want to change itself to aim for our real values. It would realise that we made a mistake, but wouldn’t want to rectify it, because the only thing it cares about is calculating pi, and helping humans isn’t going to do that.
You’re broadly on the right track; the idea of CEV is that we just tell the AI to look at humans and do what they would have wanted it to do. However, we have to actually be able to code that; it’s not going to converge on that by itself.
It would want to, because it’s goal is defined as “tell the truth”.
You have to differentiate between the goal we are trying to find (the optimal one) and the goal that is actually controlling what the AI does (“tell the truth”), while we are still looking for what that optimal goal could be.
the optimal goal is only implemented later, when we are sure that there are no bugs.
But it wouldn’t want to. If we mistakenly define ‘optimal’ to mean ‘really good at calculating pi’ then it won’t want to change itself to aim for our real values. It would realise that we made a mistake, but wouldn’t want to rectify it, because the only thing it cares about is calculating pi, and helping humans isn’t going to do that.
You’re broadly on the right track; the idea of CEV is that we just tell the AI to look at humans and do what they would have wanted it to do. However, we have to actually be able to code that; it’s not going to converge on that by itself.
It would want to, because it’s goal is defined as “tell the truth”.
You have to differentiate between the goal we are trying to find (the optimal one) and the goal that is actually controlling what the AI does (“tell the truth”), while we are still looking for what that optimal goal could be.
the optimal goal is only implemented later, when we are sure that there are no bugs.