SRStarin comments on What do superintelligences really want? [Link]

SRStarin 25 Jan 2011 2:04 UTC
1 point
That only works out for your children because you, as a father, are unable to edit your fundamental reward function. I’m not clear on whether your comment is meant to be a concise restatement of the OP, or if it’s some kind of counterexample...an example showing that even self-modifying intelligences must have a fundamental reward function that is not modifiable.

Just looking for clarity.
- TheOtherDave 25 Jan 2011 2:28 UTC
  2 points
  Parent
  The linked-to article seems to be concluding that, because a self-modifying AI can modify its own utility function, its utility function is necessarily unstable.
  
  My point is that a system’s ability to modify its utility function doesn’t actually make it likely that its utility function will change, any more than my ability to consume hemlock makes it likely that I will do so.
  
  Even given the ability to edit my utility function, whether and how I choose to use that ability depends on whether I expect doing so to get me what I want, which is constrained by (among other things) my unmodified utility function.
  - SRStarin 25 Jan 2011 2:52 UTC
    0 points
    Parent
    I don’t have data or studies to back this up, but I feel that humans have a strong tendency to return to their base state. Self-modifying AI would not do that. So, doesn’t it make sense that no AI should be made that doesn’t have a demonstrably strong tendency to return to its base state?
    
    That is, should it be a required and unmodifiable AI value that the base state has inherent value? This does have the potential to counteract some of the worst UFAI nightmares out there.
    - TheOtherDave 25 Jan 2011 3:02 UTC
      2 points
      Parent
      What are you including in your notion of an AI’s “state”? It sounds rather like you’re saying it’s safer to build non-self-modifying AIs.
      
      Which is true, of course, but there are opportunity costs associated with that.
      - SRStarin 25 Jan 2011 4:24 UTC
        0 points
        Parent
        Yes, it does seem safer to build non-self-modifying AIs. But I’m not quite saying that should be the limit. I’m saying that any AI that can self-modify ought to have a hard barrier where there is code that can’t be modified.
        
        I know there has been excitement here about a transhuman AI being able to bypass pretty much any control humans could devise (that excitement is the topic that first brought me here, in fact). But going for a century or so with AIs that can’t self-modify seems like a pretty good precaution, no?
        Perplexed 25 Jan 2011 12:03 UTC
        2 points
        Parent
        But what counts as “self-modification”?
        
        Simply making a promise could be considered self-modification, since you presumably behave differently after making the promise than you would have counterfactually.
        
        Learning some fact about the world could be considered self-modification, for the same reason.
        
        Can we come up with a useful classification scheme, distinguishing safe forms of self-modification from unsafe forms? Or, what may amount to the same thing, can we give criteria for rationally self-modifying, for each class of self-modification? That is, for example, when is it rational to make promises? When is it rational to update our beliefs about the world?
        timtyler 25 Jan 2011 22:47 UTC
        0 points
        Parent
        
        But what counts as “self-modification”?
        
        Perhaps in this context: Structural changes to yourself that are not changes to beliefs, or memories—and are not merely confined to repositioning your actuators, or day-to-day metabolism.
        
        Can we come up with a useful classification scheme, distinguishing safe forms of self-modification from unsafe forms?
        
        You could whitelist safe kinds. That might be useful—under some circumstances.
        SRStarin 25 Jan 2011 15:31 UTC
        0 points
        Parent
        Clearly, there are some internal values that an AI would need to be able to modify, or else it couldn’t learn. But I think there is good reason to disallow an AI from modifying its own rules for reward, at least to start out. An analogy in humans is that we can do some amazingly wonderful things, but some people go awry when they begin abusing drugs, thereby modifying their own reward circuitry. Severe addicts find they can’t manage a productive life, instead turning to crime to get just enough cash to feed their habits. I’d say that there is inherent danger for human intelligences in short-circuiting or otherwise modifying our reward pathways directly (i.e. chemically), and so there would likely be danger in allowing and AI to directly modify its reward pathways
        topynate 25 Jan 2011 11:29 UTC
        0 points
        Parent
        And how do you propose to stop them. Put a negative term in their reward functions?