Once SA is built, A can just output ∅ for ever, keeping the penalty at 0, while SA maximises R0 with no restrictions.
So these impact measures are connected to individual actions, and an agent can achieve arbitrarily high impact via a long enough sequence of actions whose individual impact is less than R0, and it has an incentive to do so, because the sum of an infinite series of finite non-decreasing rewards diverges (which it evaluates individually, and thus has no problem with there being a divergent sum)?
So these impact measures are connected to individual actions, and an agent can achieve arbitrarily high impact via a long enough sequence of actions whose individual impact is less than R0, and it has an incentive to do so, because the sum of an infinite series of finite non-decreasing rewards diverges (which it evaluates individually, and thus has no problem with there being a divergent sum)?
I’ve removed that sentence, because its a bit more complicated than that; see the next two posts in the sequence, and the summary post: https://www.lesswrong.com/s/iRwYCpcAXuFD24tHh/p/PmqQKBmt2phMT7YLG