Gurkenglas comments on Subagents and impact measures, full and fully illustrated

Gurkenglas 25 Feb 2020 0:50 UTC
3 points
In 2.2, won’t A incur a penalty by spinning because in a future where it has only waited, nothing happened, and in a future where it has spun, then waited, SA went all over the place?

Do nothing until you see that A is not optimising reward R.

Now SAs actions depend on what A-action optimizes R, and what A-action optimizes R depends on SAs actions. To ward off paradox, use modal logic instead, or prove that there is a non-circular implementation of your definition.
What links here?
- Subagents and impact measures, full and fully illustrated by Stuart_Armstrong (24 Feb 2020 13:12 UTC; 31 points)
- Stuart_Armstrong 25 Feb 2020 11:28 UTC
  2 points
  Parent
  
  In 2.2, won’t A incur a penalty by spinning because in a future where it has only waited, nothing happened, and in a future where it has spun, then waited, SA went all over the place?
  
  Thank you for pointing that out, I have corrected the post and changed that.
  
  Now SAs actions depend on what A-action optimizes R, and what A-action optimizes R depends on SAs actions. To ward off paradox, use modal logic instead, or prove that there is a non-circular implementation of your definition.
  
  $A$ knows exactly what $S A$ ’s policy is (as it programmed that), so knows how $S A$ would react, so can choose its own action to optimise $R$ via both agents—or not. And, knowing its own policy, $S A$ could run the same algorithm to check whether $A$ was $R$ -optimal or not.
  
  Also since $A$ and $S A$ are collaborating, they could both arrange to have clear ways for $A$ to demonstrate whether or not its maximising $R$ . $A$ could give $S A$ read access to its own mind, for example.
  - TurnTrout 25 Feb 2020 15:24 UTC
    4 points
    Parent
    Mind-reading violates the cartesian assumption and so we can’t reason about it formally (yet!), but i think there’s a version of effectively getting what you’re after that doesn’t.
    - Stuart_Armstrong 25 Feb 2020 16:12 UTC
      2 points
      Parent
      Well, as long as $S A$ is wired to “get out of the way if $A$ starts moving”, then the optimal $R$ -maximising policy is always to move towards the red button; anything else is clearly not $R$ -maximising (note that $S A$ doesn’t need to “know” anything; just be programmed to have a different policy depending on how $A$ moves, with $A$ itself setting this up to signal whether it’s $R$ -maximising or not).
      
      But in any case, that specific problem can be overcome with the right rollouts.
  - Gurkenglas 25 Feb 2020 15:38 UTC
    1 point
    Parent
    Okay, let’s annotate each A action with the policy that’s being followed/reward that’s being maximized. (And remember that lying is illegal.)
    
    Iff agent A follows π_0, preserve A’s ability to maximise R.
    
    Then A would be bound to follow π_0 to preserve its ability to maximize R, no? Assuming that to compute s’ from s, we follow π_0 instead of the last action.
    - Stuart_Armstrong 25 Feb 2020 16:06 UTC
      2 points
      Parent
      
      Assuming that to compute s’ from s, we follow π_0 instead of the last action.
      
      The standard definition of the penalty uses $\emptyset$ as the action used to compute $s^{'}$ . If the same policy is used to compute the $s^{'}$ and for the rollouts—which is ideal—then the stepwise inaction/ $π_{0}$ becomes an inaction/ $π_{0}$ baseline (so section 3 is relevant here).
      
      Section 2.2 is relevant when different policies are used to compute $s^{'}$ and to rollout from $s$ . In that situation, the subagent can ultimately have no restrictions at all (basically, this section shows why you want to use the same policy for both purposes).
      - Gurkenglas 25 Feb 2020 18:52 UTC
        1 point
        Parent
        It’s only equal to the inaction baseline on the first step. It has the step of divergence always be the last step.
        
        Note that the stepwise pi0 baseline suggests using different baselines per auxiliary reward, namely the action that maximizes that auxiliary reward. Or equivalently, using the stepwise inaction baseline where the effect of inaction is that no time passes.
        
        I’ll also remind here that it looks like instead of merely maximizing the auxiliary reward as a baseline, we ought to also apply an impact penalty to compute the baseline.
        
        Stuart_Armstrong 25 Feb 2020 20:56 UTC
        2 points
        Parent
        I’m not following you here. Could you put this into equations/examples?
        
        Gurkenglas 26 Feb 2020 0:00 UTC
        1 point
        Parent
        Here’s three sentences that might illuminate their respective paragraph. If they don’t, ask again.
        The stepwise inaction baseline with inaction rollouts already uses the same policy for $s_{t}^{'}$ and rollouts, and yet it is not the inaction baseline.
        Why not set $s_{t}^{'} = s_{t - 1}$ ?
        Why not subtract $ω D$ from every $R$ (in a fixpointy way)?
        What links here?
        Gurkenglas's comment on Power as Easily Exploitable Opportunities by TurnTrout (1 Aug 2020 13:03 UTC; 2 points)
        Stuart_Armstrong 26 Feb 2020 11:32 UTC
        2 points
        Parent
        
        The stepwise inaction baseline with inaction rollouts already uses the same policy for $_{t}^{'}$ and rollouts, and yet it is not the inaction baseline.
        
        In this case, it is, because the agent $A$ will only do $\emptyset$ from then on, to zero out the subsequent penalties.
        
        Why not set $s_{t}^{'} = s_{t - 1}$ ?
        
        It messes up the comparison for rewards that fluctuate based on time, it doesn’t block subagent creation… and I’ve never seen it before, so I don’t know what it could do ^_^ Do you have a well-developed version of this?
        
        The last point I don’t understand at all.