We’re doing meta-learning. During training, the network is not learning about the real world, it’s learning how to be a safe predictor. It’s interacting with a synthetic environment, so a misprediction doesn’t have any catastrophic effects: it only teaches the algorithm that this version of the predictor is unsafe. In other words, the malign subagents have no way to attack during training because they can access little information about what the real universe is like. The training process is designed to select predictors that only make predictions when they can be confident, and the training performance allows us to verify this goal has truly been achieved.
We’re doing meta-learning. During training, the network is not learning about the real world, it’s learning how to be a safe predictor. It’s interacting with a synthetic environment, so a misprediction doesn’t have any catastrophic effects: it only teaches the algorithm that this version of the predictor is unsafe. In other words, the malign subagents have no way to attack during training because they can access little information about what the real universe is like. The training process is designed to select predictors that only make predictions when they can be confident, and the training performance allows us to verify this goal has truly been achieved.