It’s worth noting that you don’t necessarily need to train models to actually do dangerous actions like literally executing on a takeover attempt, you can just train models which do something which is a proxy to coups (or a proxy to some part of coups).
The extent to which this proxy itself dangerous or generalizing in a dangerous way will vary.
Like suppose I want to see if my oversight process discovers when AIs try to insert malign backdoor for the purposes of takeover. I can just train an AI to insert bugs which cause some specific observable effect later (e.g., the code should delete some specific file in some circumstance) and this observable effect doesn’t itself need to be dangerous.
It’s worth noting that you don’t necessarily need to train models to actually do dangerous actions like literally executing on a takeover attempt, you can just train models which do something which is a proxy to coups (or a proxy to some part of coups).
The extent to which this proxy itself dangerous or generalizing in a dangerous way will vary.
Like suppose I want to see if my oversight process discovers when AIs try to insert malign backdoor for the purposes of takeover. I can just train an AI to insert bugs which cause some specific observable effect later (e.g., the code should delete some specific file in some circumstance) and this observable effect doesn’t itself need to be dangerous.