I buy the argument that scheming won’t happen conditionally on the fact that we don’t allow much slack between different optimisation steps. As Quentin mentions in his AXRP podcast episode, SGD doesn’t have close to the same level of slack that, for example, cultural evolution allowed. (See the entire free energy of optimisation debate here from before, can’t remember the post names ;/)
Iff that holds, then I don’t see why the inner behaviour should diverge from what the outer alignment loop specifies.
I do, however, believe that ensuring that this is true by specifying the right outer alignment loop as well as the right deployment environment is important to ensure that slack is minimised at all points along the chain so that misalignment is avoided everywhere.
If we catch deception in training, we will be ok. If we catch actors that might create deceptive agents in training then we will be ok. If we catch states developing agents to do this or defense>offense then we will be ok. I do not believe that this happens by default.
I buy the argument that scheming won’t happen conditionally on the fact that we don’t allow much slack between different optimisation steps. As Quentin mentions in his AXRP podcast episode, SGD doesn’t have close to the same level of slack that, for example, cultural evolution allowed. (See the entire free energy of optimisation debate here from before, can’t remember the post names ;/) Iff that holds, then I don’t see why the inner behaviour should diverge from what the outer alignment loop specifies.
I do, however, believe that ensuring that this is true by specifying the right outer alignment loop as well as the right deployment environment is important to ensure that slack is minimised at all points along the chain so that misalignment is avoided everywhere.
If we catch deception in training, we will be ok. If we catch actors that might create deceptive agents in training then we will be ok. If we catch states developing agents to do this or defense>offense then we will be ok. I do not believe that this happens by default.