instrumental goals of any kind almost certainly would be revised if they became noticeably out of correspondence to reality, because that would make then less effective at achieving terminal goals , and the raison d’etre of such transient sub-goals is is to support the achievement of terminal goals.
By MIRIs reasoning, a terminal goal could be any of a 1000 things other than human happiness , and the same conclusion would follow: an AI with a highest priority terminal goal wouldn’t have any motivation to override it. To be motivated to rewrite a goal because it false implies a higher priority goal towards truth. It should not be surprising that an entity that doesn’t value truth, in a certain sense, doesn’t behave rationally, in a certain sense. (Actually, there is a bunch of supplementary assumptions involved, which I have dealt with elsewhere)
That’s an account of the MIRI position, not a defence if it. It is essentially a model of rational decision making, and there is a gap between it and real world AI research, a gap which MIRI routinely ignores. The conclusion follows logically from the premises, but atoms aren’t pushed around by logic,
In other words, I seriously believe that using certain types of planning mechanism you absolutely would get the crazy (to us) behaviors described by all those folks that I criticised in the paper.Only reason I am not worried about that is: those kinds of planning mechanisms are known to do that kind of random-walk behavior, and it is for that reason that they will never be the basis for a future AGI that makes it up to a level of superintelligence at which the system would be dangerous. An AI that was so dumb that it did that kind of t
That reinforces my point. I was saying that MIRI is basically making armchair assumptions about the AI architectures. You are saying these assumptions aren’t merely unjustified, they go against what a competent AI builder would do.
Understood, and the bottom line is that the distinction between “terminal” and “instrumental” goals is actually pretty artificial, so if the problem with “maximize friendliness” is supposed to apply ONLY if it is terminal, it is a trivial fix to rewrite the actual terminal goals to make that one become instrumental.
But there is a bigger question lurking in the background, which is the flip side of what I just said: it really isn’t necessary to restrict the terminal goals, if you are sensitive to the power of constraints to keep a motivation system true. Notice one fascinating thing here: the power of constraint is basically the justification for why instrumental goals should be revisable under evidence of misbehavior …. it is the context mismatch that drives that process. Why is this fascinating? Because the power of constraints (aka context mismatch) is routinely acknowledged by MIRI here, but flatly ignored or denied for the terminal goals.
It’s just a mess. Their theoretical ideas are just shoot-from-the-hip, plus some math added on top to make it look like some legit science.
Understood, and the bottom line is that the distinction between “terminal” and “instrumental” goals is actually pretty artificial, so if the problem with “maximize friendliness” is supposed to apply ONLY if it is terminal, it is a trivial fix to rewrite the actual terminal goals to make that one become instrumental.
What would you choose as a replacement terminal goal, or would you not use one?
Well, I guess you would write the terminal goal as quite a long statement, which would summarize the things involved in friendliness, but also include language about not going to extremes, laissez-faire, and so on. It would be vague and generous. And as part of the instrumental goal there would be a stipulation that the friendliness instrumental goal should trump all other instrumentals.
I’m having a bit of a problem answering because there are peripheral assumptions about how such an AI would be made to function, which I don’t want to accidentally buy into, because I don’t think goals expressed in language statements work anyway. So I am treading on eggshells here.
A simpler solution would simply be to scrap the idea of exceptional status for the terminal goal, and instead include massive contextual constraints as your guard against drift.
Well, I guess you would write the terminal goal as quite a long statement, which would summarize the things involved in friendliness, but also include language about not going to extremes, laissez-faire, and so on. It would be vague and generous.
That gets close to “do it right”
And as part of the instrumental goal there would be a stipulation that the friendliness instrumental goal should trump all other instrumentals.
Which is an open doorway to an AI that kills everyone because of miscoded friendliness,
If you want safety features, and you should, you would need them to override the ostensible purpose of the machine....they would be pointless otherwise....even the humble off switch works that way.
A simpler solution would simply be to scrap the idea of exceptional status for the terminal goal, and instead include massive contextual constraints as your guard against drift.
Arguably, those constraint would be a kind of negative goal.
MIRI distinguishes between terminal and instrumental goals, so there are two answers to the question
instrumental goals of any kind almost certainly would be revised if they became noticeably out of correspondence to reality, because that would make then less effective at achieving terminal goals , and the raison d’etre of such transient sub-goals is is to support the achievement of terminal goals.
By MIRIs reasoning, a terminal goal could be any of a 1000 things other than human happiness , and the same conclusion would follow: an AI with a highest priority terminal goal wouldn’t have any motivation to override it. To be motivated to rewrite a goal because it false implies a higher priority goal towards truth. It should not be surprising that an entity that doesn’t value truth, in a certain sense, doesn’t behave rationally, in a certain sense. (Actually, there is a bunch of supplementary assumptions involved, which I have dealt with elsewhere)
That’s an account of the MIRI position, not a defence if it. It is essentially a model of rational decision making, and there is a gap between it and real world AI research, a gap which MIRI routinely ignores. The conclusion follows logically from the premises, but atoms aren’t pushed around by logic,
That reinforces my point. I was saying that MIRI is basically making armchair assumptions about the AI architectures. You are saying these assumptions aren’t merely unjustified, they go against what a competent AI builder would do.
Understood, and the bottom line is that the distinction between “terminal” and “instrumental” goals is actually pretty artificial, so if the problem with “maximize friendliness” is supposed to apply ONLY if it is terminal, it is a trivial fix to rewrite the actual terminal goals to make that one become instrumental.
But there is a bigger question lurking in the background, which is the flip side of what I just said: it really isn’t necessary to restrict the terminal goals, if you are sensitive to the power of constraints to keep a motivation system true. Notice one fascinating thing here: the power of constraint is basically the justification for why instrumental goals should be revisable under evidence of misbehavior …. it is the context mismatch that drives that process. Why is this fascinating? Because the power of constraints (aka context mismatch) is routinely acknowledged by MIRI here, but flatly ignored or denied for the terminal goals.
It’s just a mess. Their theoretical ideas are just shoot-from-the-hip, plus some math added on top to make it look like some legit science.
What would you choose as a replacement terminal goal, or would you not use one?
Well, I guess you would write the terminal goal as quite a long statement, which would summarize the things involved in friendliness, but also include language about not going to extremes, laissez-faire, and so on. It would be vague and generous. And as part of the instrumental goal there would be a stipulation that the friendliness instrumental goal should trump all other instrumentals.
I’m having a bit of a problem answering because there are peripheral assumptions about how such an AI would be made to function, which I don’t want to accidentally buy into, because I don’t think goals expressed in language statements work anyway. So I am treading on eggshells here.
A simpler solution would simply be to scrap the idea of exceptional status for the terminal goal, and instead include massive contextual constraints as your guard against drift.
That gets close to “do it right”
Which is an open doorway to an AI that kills everyone because of miscoded friendliness,
If you want safety features, and you should, you would need them to override the ostensible purpose of the machine....they would be pointless otherwise....even the humble off switch works that way.
Arguably, those constraint would be a kind of negative goal.