Beware Goodhart’s Law: you’re setting rules of the game that the “disciple AI” has an incentive to subvert. Essentially, you’re specifying a wish, and normally your ability to evaluate a wish is constrained by your ability to consider and (morally) evaluate all the possible consequences (strategies) in detail. An AI might find a strategy that, while satisfying your wish, would be disastrous (which might win the AI a prize so insignificant it’d never rise to your attention).
Beware Goodhart’s Law: you’re setting rules of the game that the “disciple AI” has an incentive to subvert.
Yes, that’s always the risk. But here it’s the master AI checking that disciple AI would likely behave; so, for instance, it would not give the disciple more optimization power than itself if this was a risk.
That just pushed the risk back to the master. But every requirement is a wish (including a seemingly proved friendly utility function). These requirements (if rigourized) seem much less vulnerable than most. Do you feel it has specific flaws?
Not quite. The risk is in the choice of the wish, even if there is no risk in its implementation. “Master” implements the wish by ensuring its rules will be followed, but it doesn’t morally evaluate the wish. The fundamental problem with wishes is that when one is taken literally as stated, there are still many “loopholes” (morally abhorrent courses of action) remaining within the wish, without breaking its rules.
But every requirement is a wish (including a seemingly proved friendly utility function).
The difference is that a Friendly goal is specifically constructed to counteract the problem of unnoticed moral consequences, because it has the ability to actually morally evaluate the consequences, unlike other wishes that evaluate something else, and whose suitability is evaluated by mere humans who can’t take all the consequences into account.
I don’t really see the difference—with standard wishes, we wonder if we’ve really captured what we want the wish to capture, with a friendly utility, we wonder if we’ve really captured the morality we wanted.
A perfect friendly utility is going to be better than a perfect wish, but it’s not clear which imperfect version is better—a friendly utility is also much harder.
One consideration is the amount of information in the coarse graining measures: we could set it up so there are more measurements made than there are bits in the disciple AI’s source code. Not a guarantee of anything, of course, but Goodhart’s law mainly derives from how short the success indicator is compared with the phenomena it’s trying to measure, so hence subverting the law is easier than improving the phenomena.
It is probably worth noting here that AI’s ability to evaluate measure of matching your wish and consequences that you need is, in turn, limited by its own ability to evaluate consequences of its actions (if we apply the constraint that you are talking about to AI itself). That can easily turn into requirement of building a Maxwell’s demon or AI admitting (huh..) that it is doing something about which it doesn’t know if it will match your wish or not.
Beware Goodhart’s Law: you’re setting rules of the game that the “disciple AI” has an incentive to subvert. Essentially, you’re specifying a wish, and normally your ability to evaluate a wish is constrained by your ability to consider and (morally) evaluate all the possible consequences (strategies) in detail. An AI might find a strategy that, while satisfying your wish, would be disastrous (which might win the AI a prize so insignificant it’d never rise to your attention).
Yes, that’s always the risk. But here it’s the master AI checking that disciple AI would likely behave; so, for instance, it would not give the disciple more optimization power than itself if this was a risk.
That just pushed the risk back to the master. But every requirement is a wish (including a seemingly proved friendly utility function). These requirements (if rigourized) seem much less vulnerable than most. Do you feel it has specific flaws?
Not quite. The risk is in the choice of the wish, even if there is no risk in its implementation. “Master” implements the wish by ensuring its rules will be followed, but it doesn’t morally evaluate the wish. The fundamental problem with wishes is that when one is taken literally as stated, there are still many “loopholes” (morally abhorrent courses of action) remaining within the wish, without breaking its rules.
The difference is that a Friendly goal is specifically constructed to counteract the problem of unnoticed moral consequences, because it has the ability to actually morally evaluate the consequences, unlike other wishes that evaluate something else, and whose suitability is evaluated by mere humans who can’t take all the consequences into account.
I don’t really see the difference—with standard wishes, we wonder if we’ve really captured what we want the wish to capture, with a friendly utility, we wonder if we’ve really captured the morality we wanted.
A perfect friendly utility is going to be better than a perfect wish, but it’s not clear which imperfect version is better—a friendly utility is also much harder.
One consideration is the amount of information in the coarse graining measures: we could set it up so there are more measurements made than there are bits in the disciple AI’s source code. Not a guarantee of anything, of course, but Goodhart’s law mainly derives from how short the success indicator is compared with the phenomena it’s trying to measure, so hence subverting the law is easier than improving the phenomena.
It is probably worth noting here that AI’s ability to evaluate measure of matching your wish and consequences that you need is, in turn, limited by its own ability to evaluate consequences of its actions (if we apply the constraint that you are talking about to AI itself). That can easily turn into requirement of building a Maxwell’s demon or AI admitting (huh..) that it is doing something about which it doesn’t know if it will match your wish or not.