Maybe this have been said before, but here is a simple idea:
Directly specify a utility function U which you are not sure about, but also discount AI’s own power as part of it. So the new utility function is U—power(AI), where power is a fast growing function of a mix of AI’s source code complexity, intelligence, hardware, electricity costs. One needs to be careful of how to define “self” in this case, as a careful redefinition by the AI will remove the controls.
One also needs to consider the creation of subagents with proper utilities as well, since in a naive implementation, sub-agents will just optimize U, without restrictions.
This is likely not enough, but has the advantage that the AI does not have a will to become stronger a priori, which is better than boxing an AI which does.
I’m not convinced that sufficiently intelligent agents would create subagents with utility functions that lack terms of the original’s UF, at least with a suitable precaution. The example you used (an AI wanting to stay in the box letting out an agent to convert all box-hazards into raw material) seems as though the Boxed AI would want to ensure that the Unboxed Agent was Boxed-AI-Friendly. What would then happen if the Boxed AI had an unalterable belief that its utility function were likely to change in the future, and it couldn’t predict how?
Some formalized difference between intentional probability manipulation and unintentional/unpredicted but causally-related happenings would be nice. Minimized intentional impact would then be where an AI would not wish to effect actions on issues of great impact and defer to humans. I’m not sure how it would behave when a human then deferred to the AI. It seems like it would be a sub-CEV result, because the human would be biased, have scope insensitivity, prejudices, etc...And then it seems like the natural improvement would be to have the AI implement DWIM CEV.
Has much thought gone into defining utility functions piecewise, or multiplicatively wrt some epistemic probabilities? I’m not sure if I’m just reiterating corrigibility here, but say an agent has a utility function of some utility function U that equals U/P(“H”) + H*P(“H”), where P(“H”) is the likelihood that the Gatekeeper thinks the AI should be halted and H is the utility function rewarding halting and penalizing continuation. That was an attempt at a probabilistic piecewise UF that equals “if P(“H”) then H else U.”
Apologies for any incoherency, this is a straight-up brain dump.
Presumably anything caused to exist by the AI (including copies, sub-agents, other AIs) would have to count as part of the power(AI) term? So this stops the AI spawning monsters which simply maximise U.
One problem is that any really valuable things (under U) are also likely to require high power. This could lead to an AI which knows how to cure cancer but won’t tell anyone (because that will have a very high impact, hence a big power(AI) term). That situation is not going to be stable; the creators will find it irresistible to hack the U and get it to speak up.
Maybe this have been said before, but here is a simple idea:
Directly specify a utility function U which you are not sure about, but also discount AI’s own power as part of it. So the new utility function is U—power(AI), where power is a fast growing function of a mix of AI’s source code complexity, intelligence, hardware, electricity costs. One needs to be careful of how to define “self” in this case, as a careful redefinition by the AI will remove the controls.
One also needs to consider the creation of subagents with proper utilities as well, since in a naive implementation, sub-agents will just optimize U, without restrictions.
This is likely not enough, but has the advantage that the AI does not have a will to become stronger a priori, which is better than boxing an AI which does.
That’s an idea that a) will certainly not work as stated, b) could point the way to something very interesting.
I’m not convinced that sufficiently intelligent agents would create subagents with utility functions that lack terms of the original’s UF, at least with a suitable precaution. The example you used (an AI wanting to stay in the box letting out an agent to convert all box-hazards into raw material) seems as though the Boxed AI would want to ensure that the Unboxed Agent was Boxed-AI-Friendly. What would then happen if the Boxed AI had an unalterable belief that its utility function were likely to change in the future, and it couldn’t predict how?
Some formalized difference between intentional probability manipulation and unintentional/unpredicted but causally-related happenings would be nice. Minimized intentional impact would then be where an AI would not wish to effect actions on issues of great impact and defer to humans. I’m not sure how it would behave when a human then deferred to the AI. It seems like it would be a sub-CEV result, because the human would be biased, have scope insensitivity, prejudices, etc...And then it seems like the natural improvement would be to have the AI implement DWIM CEV.
Has much thought gone into defining utility functions piecewise, or multiplicatively wrt some epistemic probabilities? I’m not sure if I’m just reiterating corrigibility here, but say an agent has a utility function of some utility function U that equals U/P(“H”) + H*P(“H”), where P(“H”) is the likelihood that the Gatekeeper thinks the AI should be halted and H is the utility function rewarding halting and penalizing continuation. That was an attempt at a probabilistic piecewise UF that equals “if P(“H”) then H else U.”
Apologies for any incoherency, this is a straight-up brain dump.
Presumably anything caused to exist by the AI (including copies, sub-agents, other AIs) would have to count as part of the power(AI) term? So this stops the AI spawning monsters which simply maximise U.
One problem is that any really valuable things (under U) are also likely to require high power. This could lead to an AI which knows how to cure cancer but won’t tell anyone (because that will have a very high impact, hence a big power(AI) term). That situation is not going to be stable; the creators will find it irresistible to hack the U and get it to speak up.
I’m looking at ways round that kind of obstacle. I’ll be posting them someday if they work.