agilecaveman comments on New(ish) AI control ideas

agilecaveman 11 Mar 2015 4:59 UTC
11 points
Maybe this have been said before, but here is a simple idea:

Directly specify a utility function U which you are not sure about, but also discount AI’s own power as part of it. So the new utility function is U—power(AI), where power is a fast growing function of a mix of AI’s source code complexity, intelligence, hardware, electricity costs. One needs to be careful of how to define “self” in this case, as a careful redefinition by the AI will remove the controls.

One also needs to consider the creation of subagents with proper utilities as well, since in a naive implementation, sub-agents will just optimize U, without restrictions.

This is likely not enough, but has the advantage that the AI does not have a will to become stronger a priori, which is better than boxing an AI which does.
- Stuart_Armstrong 11 Mar 2015 10:26 UTC
  10 points
  Parent
  That’s an idea that a) will certainly not work as stated, b) could point the way to something very interesting.
  - Transfuturist 11 Mar 2015 22:23 UTC
    4 points
    Parent
    I’m not convinced that sufficiently intelligent agents would create subagents with utility functions that lack terms of the original’s UF, at least with a suitable precaution. The example you used (an AI wanting to stay in the box letting out an agent to convert all box-hazards into raw material) seems as though the Boxed AI would want to ensure that the Unboxed Agent was Boxed-AI-Friendly. What would then happen if the Boxed AI had an unalterable belief that its utility function were likely to change in the future, and it couldn’t predict how?
    
    Some formalized difference between intentional probability manipulation and unintentional/unpredicted but causally-related happenings would be nice. Minimized intentional impact would then be where an AI would not wish to effect actions on issues of great impact and defer to humans. I’m not sure how it would behave when a human then deferred to the AI. It seems like it would be a sub-CEV result, because the human would be biased, have scope insensitivity, prejudices, etc...And then it seems like the natural improvement would be to have the AI implement DWIM CEV.
    
    Has much thought gone into defining utility functions piecewise, or multiplicatively wrt some epistemic probabilities? I’m not sure if I’m just reiterating corrigibility here, but say an agent has a utility function of some utility function U that equals U/P(“H”) + H*P(“H”), where P(“H”) is the likelihood that the Gatekeeper thinks the AI should be halted and H is the utility function rewarding halting and penalizing continuation. That was an attempt at a probabilistic piecewise UF that equals “if P(“H”) then H else U.”
    
    Apologies for any incoherency, this is a straight-up brain dump.
- drnickbone 14 Mar 2015 21:04 UTC
  1 point
  Parent
  Presumably anything caused to exist by the AI (including copies, sub-agents, other AIs) would have to count as part of the power(AI) term? So this stops the AI spawning monsters which simply maximise U.
  
  One problem is that any really valuable things (under U) are also likely to require high power. This could lead to an AI which knows how to cure cancer but won’t tell anyone (because that will have a very high impact, hence a big power(AI) term). That situation is not going to be stable; the creators will find it irresistible to hack the U and get it to speak up.
  - Stuart_Armstrong 19 Mar 2015 13:43 UTC
    2 points
    Parent
    I’m looking at ways round that kind of obstacle. I’ll be posting them someday if they work.