TurnTrout comments on Towards a New Impact Measure

TurnTrout 21 Sep 2018 15:32 UTC
LW: 2 AF: 2
0
AF

Compare this with a″ = “build one paperclip”. Then your setup should, if I’m understanding it correctly, penalise a″ more than a’. Because a″ entails a divergence in future observations, while a’ does not.

This is not the case. I think there’s a confusion here—Q_u is not with respect to what we’ve seen so far. We aren’t taking the utility difference between u(h_action) and u(h_inaction). We’re instead comparing our ability to maximize each utility. How does a’ not drastically improve our ability to maximize arbitrary utilities, or at least use up a ton of energy in the process compared to not doing it?

That is, it’s not what we’ve seen so far that matters, per se. You’ll notice that the Q_u equation doesn’t even consider what happened before the new utility “takes over”.

Consider the end of the plan – the agent is now in a position to spread and expand and complete arbitrary goals to a much greater degree than if it had just done nothing for an hour. The fact that human behavior seems the same isn’t actually very relevant.

The U^- do, however, all have fewer resources available now, even if we assume they don’t mind what just happened.

there is a slight flaw in the proof: you have to consider utilities that differ on histories where all the observations are the same, but the actions are not

Can you expand? I thought that this was covered by the first part.
- Stuart_Armstrong 21 Sep 2018 15:46 UTC
  LW: 4 AF: 3
  0
  AF Parent
  The subagent could, for example, secret caches of extra energy around the place (energy that would otherwise be wasted). Or maybe the human-robots are just more efficient, in ways that make up for any lost resources. Or, more worryingly, maybe the AI is destined to be run for a short while and then turned off, and a’ is the only action it could ever take to “escape”. Thus its expected utility for all the U’s are pretty much the same (and pretty low) for all utilities.
  
  (and if you include arbitrary utilities, these include impossible-to-not-affect ones, like ones dependent on the waste heat rate of the AI; I don’t see the complexity penalty corresponding well to a “unimportant to humans” measure).
  - TurnTrout 21 Sep 2018 16:01 UTC
    LW: 6 AF: 3
    0
    AF Parent
    
    The subagent could, for example, secret caches of extra energy around the place (energy that would otherwise be wasted).
    
    Ok, supposing that can work: “build world dominating device and then spread into space” still drastically increases power.
    
    Or maybe the human-robots are just more efficient, in ways that make up for any lost resources.
    
    What are your intuitions for why building the sub-agent to begin with isn’t extremely instrumentally convergent?
    
    Or, more worryingly, maybe the AI is destined to be run for a short while and then turned off, and a’ is the only action it could ever take to “escape”.
    
    Lemma 3 proves it does nothing here. One of the main points of AUP is that it seems to cripple the survival incentive, so I’m confused why this came to mind.
    
    expected utility for all the U’s are pretty much the same (and pretty low) for all utilities.
    
    Utility != attainable utility. “how many dollars I have” != “how many additional dollars I could have within a year if I acted optimally”
    
    like ones dependent on the waste heat rate of the AI; I don’t see the complexity penalty corresponding well to a “unimportant to humans” measure
    
    Again, it’s not about utility change, but attainable utility change. The distinction is crucial.
    
    I assert that we don’t need such a measure, for the reasons that I outlined in Utility Selection. That is, the content of the utilities seems to not actually be that important.
    
    How can an agent become really powerful without also becoming really powerful, therefore getting penalized?
    - Stuart_Armstrong 21 Sep 2018 20:30 UTC
      LW: 2 AF: 1
      0
      AF Parent
      
      Ok, supposing that can work: “build world dominating device and then spread into space” still drastically increases power.
      
      Not if the subagent is designed to not allow the increase in power. As in, the subagent is designed to maximise $u_{A}$ , but also to ensure the penalty term is as small as possible. The subagent is deliberately beyond the control of the initial agent.
      
      (is this the issue we are disagreeing about, or have I misunderstood?)
      - TurnTrout 21 Sep 2018 21:10 UTC
        LW: 1 AF: 1
        0
        AF Parent
        I think that you think that the method uses expected utility (which it doesn’t), so we’re secretly talking about really different things. The method doesn’t actually break in the ways you’ve said so far (as far as I can tell), but an expected utility method would. Compare how the post’s examples break down under your model, and contrast with the math I provided for attainable utilities.
        
        For example, the “EU difference over plans” model applies .316 penalty to disabling the off-switch (due to coincidentally capturing change in the agent’s actual utility for that time step, since it’s in the left square), yet my calculations show .95 penalty. Along these lines, AUP provably never tries to escape certain shutdown (that is, it expects to be shut down no matter which action it chooses at the present step), while an EU method would have no compunctions about doing so in many situations (as you pointed out earlier).
        
        Anyways, in general, supposing that the agent instantaneously builds successors that do things it can’t control either impedes its ability to maximize other objectives (if the agent can’t hijack this new powerful agent), or improves it (if the agent can hijack this new powerful agent).
        
        Stuart_Armstrong 25 Sep 2018 14:19 UTC
        LW: 10 AF: 2
        0
        AF Parent
        Here is a writeup of the problem I believe your method has: https://www.lesswrong.com/posts/6EMdmeosYPdn74wuG/wireheading-as-potential-problem-with-the-new-impact-measure
- Stuart_Armstrong 21 Sep 2018 15:52 UTC
  LW: 3 AF: 2
  0
  AF Parent
  
  Can you expand?
  
  Suppose the AI is addressing a letter containing $1,000,000. It can address this to Jane Brown, or to John Smith. Once addressed, AI will be turned off, and the letter will be posted.
  
  A utility $u_{B}$ that values Jane Brown would like the letter addressed to her, and vice versa for a utility $u_{S}$ that values John Smith. These two utilities differ only on the action the AI takes, not on subsequent observations. Therefore “This implies that by choosing $a$ , the agent expects to observe some $u_{A}$ -high scoring $o_{A}$ with greater probability than if it had selected $\emptyset$ ” is false—it need not expect to observe anything at all.
  
  However the theorem is still true, because we just need to consider utilities that differ on actions—such as $u_{B}$ and $u_{S}$ .