Vanessa Kosoy comments on Formal Solution to the Inner Alignment Problem

Vanessa Kosoy 28 Feb 2021 12:00 UTC
LW: 3 AF: 2
0
AF
We’re doing meta-learning. During training, the network is not learning about the real world, it’s learning how to be a safe predictor. It’s interacting with a synthetic environment, so a misprediction doesn’t have any catastrophic effects: it only teaches the algorithm that this version of the predictor is unsafe. In other words, the malign subagents have no way to attack during training because they can access little information about what the real universe is like. The training process is designed to select predictors that only make predictions when they can be confident, and the training performance allows us to verify this goal has truly been achieved.