Here is a quick explanatiton. It is a good thing because any agent who does not get counterfactually mugged would self modify into one that does before it sees the value of the coin.
There are some subagent alignment approaches that require some understanding. For example transparancy and informed oversight (In informed oversight, we even assume that the agent we are aligning is less powerful, just not exponentionally less powerful.), and other approaches that treat the agent you are trying to align as a black box.
Here is a quick explanatiton. It is a good thing because any agent who does not get counterfactually mugged would self modify into one that does before it sees the value of the coin.
There are some subagent alignment approaches that require some understanding. For example transparancy and informed oversight (In informed oversight, we even assume that the agent we are aligning is less powerful, just not exponentionally less powerful.), and other approaches that treat the agent you are trying to align as a black box.