Beware Goodhart’s Law: you’re setting rules of the game that the “disciple AI” has an incentive to subvert.
Yes, that’s always the risk. But here it’s the master AI checking that disciple AI would likely behave; so, for instance, it would not give the disciple more optimization power than itself if this was a risk.
That just pushed the risk back to the master. But every requirement is a wish (including a seemingly proved friendly utility function). These requirements (if rigourized) seem much less vulnerable than most. Do you feel it has specific flaws?
Not quite. The risk is in the choice of the wish, even if there is no risk in its implementation. “Master” implements the wish by ensuring its rules will be followed, but it doesn’t morally evaluate the wish. The fundamental problem with wishes is that when one is taken literally as stated, there are still many “loopholes” (morally abhorrent courses of action) remaining within the wish, without breaking its rules.
But every requirement is a wish (including a seemingly proved friendly utility function).
The difference is that a Friendly goal is specifically constructed to counteract the problem of unnoticed moral consequences, because it has the ability to actually morally evaluate the consequences, unlike other wishes that evaluate something else, and whose suitability is evaluated by mere humans who can’t take all the consequences into account.
I don’t really see the difference—with standard wishes, we wonder if we’ve really captured what we want the wish to capture, with a friendly utility, we wonder if we’ve really captured the morality we wanted.
A perfect friendly utility is going to be better than a perfect wish, but it’s not clear which imperfect version is better—a friendly utility is also much harder.
Yes, that’s always the risk. But here it’s the master AI checking that disciple AI would likely behave; so, for instance, it would not give the disciple more optimization power than itself if this was a risk.
That just pushed the risk back to the master. But every requirement is a wish (including a seemingly proved friendly utility function). These requirements (if rigourized) seem much less vulnerable than most. Do you feel it has specific flaws?
Not quite. The risk is in the choice of the wish, even if there is no risk in its implementation. “Master” implements the wish by ensuring its rules will be followed, but it doesn’t morally evaluate the wish. The fundamental problem with wishes is that when one is taken literally as stated, there are still many “loopholes” (morally abhorrent courses of action) remaining within the wish, without breaking its rules.
The difference is that a Friendly goal is specifically constructed to counteract the problem of unnoticed moral consequences, because it has the ability to actually morally evaluate the consequences, unlike other wishes that evaluate something else, and whose suitability is evaluated by mere humans who can’t take all the consequences into account.
I don’t really see the difference—with standard wishes, we wonder if we’ve really captured what we want the wish to capture, with a friendly utility, we wonder if we’ve really captured the morality we wanted.
A perfect friendly utility is going to be better than a perfect wish, but it’s not clear which imperfect version is better—a friendly utility is also much harder.