Gurkenglas comments on Subagents and impact measures, full and fully illustrated

Gurkenglas 26 Feb 2020 0:00 UTC
1 point
Here’s three sentences that might illuminate their respective paragraph. If they don’t, ask again.
The stepwise inaction baseline with inaction rollouts already uses the same policy for $s_{t}^{'}$ and rollouts, and yet it is not the inaction baseline.
Why not set $s_{t}^{'} = s_{t - 1}$ ?
Why not subtract $ω D$ from every $R$ (in a fixpointy way)?
What links here?
- Gurkenglas's comment on Power as Easily Exploitable Opportunities by TurnTrout (1 Aug 2020 13:03 UTC; 2 points)
- Stuart_Armstrong 26 Feb 2020 11:32 UTC
  2 points
  Parent
  
  The stepwise inaction baseline with inaction rollouts already uses the same policy for $_{t}^{'}$ and rollouts, and yet it is not the inaction baseline.
  
  In this case, it is, because the agent $A$ will only do $\emptyset$ from then on, to zero out the subsequent penalties.
  
  Why not set $s_{t}^{'} = s_{t - 1}$ ?
  
  It messes up the comparison for rewards that fluctuate based on time, it doesn’t block subagent creation… and I’ve never seen it before, so I don’t know what it could do ^_^ Do you have a well-developed version of this?
  
  The last point I don’t understand at all.