michaelcohen comments on Formal Solution to the Inner Alignment Problem

michaelcohen 28 Feb 2021 11:32 UTC
LW: 1 AF: 1
AF
What’s the distinction between training and deployment when the model can always query for more data?
- Vanessa Kosoy 28 Feb 2021 12:00 UTC
  LW: 3 AF: 2
  AF Parent
  We’re doing meta-learning. During training, the network is not learning about the real world, it’s learning how to be a safe predictor. It’s interacting with a synthetic environment, so a misprediction doesn’t have any catastrophic effects: it only teaches the algorithm that this version of the predictor is unsafe. In other words, the malign subagents have no way to attack during training because they can access little information about what the real universe is like. The training process is designed to select predictors that only make predictions when they can be confident, and the training performance allows us to verify this goal has truly been achieved.