Stuart_Armstrong comments on Subagents and impact measures, full and fully illustrated

Stuart_Armstrong 25 Feb 2020 20:56 UTC
2 points
I’m not following you here. Could you put this into equations/examples?
- Gurkenglas 26 Feb 2020 0:00 UTC
  1 point
  Parent
  Here’s three sentences that might illuminate their respective paragraph. If they don’t, ask again.
  The stepwise inaction baseline with inaction rollouts already uses the same policy for $s_{t}^{'}$ and rollouts, and yet it is not the inaction baseline.
  Why not set $s_{t}^{'} = s_{t - 1}$ ?
  Why not subtract $ω D$ from every $R$ (in a fixpointy way)?
  What links here?
  - Gurkenglas's comment on Power as Easily Exploitable Opportunities by TurnTrout (1 Aug 2020 13:03 UTC; 2 points)
  - Stuart_Armstrong 26 Feb 2020 11:32 UTC
    2 points
    Parent
    
    The stepwise inaction baseline with inaction rollouts already uses the same policy for $_{t}^{'}$ and rollouts, and yet it is not the inaction baseline.
    
    In this case, it is, because the agent $A$ will only do $\emptyset$ from then on, to zero out the subsequent penalties.
    
    Why not set $s_{t}^{'} = s_{t - 1}$ ?
    
    It messes up the comparison for rewards that fluctuate based on time, it doesn’t block subagent creation… and I’ve never seen it before, so I don’t know what it could do ^_^ Do you have a well-developed version of this?
    
    The last point I don’t understand at all.