In this story deception is all about the model having hidden behaviors that never get triggered during training
Not necessarily—depends on how abstractly we’re considering behaviours. (It also depends on how likely we are to detect the bad behaviours during training.)
Consider an AI trained on addition problems that is only exposed to a few problems that look like 1+3=4, 3+7=10, 2+5=7, 2+6=8 during training, where there are two summands which are each a single digit and they appear in ascending order. Now at inference time the model exposed to 10+2= outputs 12.
Have we triggered a hidden behaviour that was never encountered in training? Certainly these inputs were never encountered, and there’s maybe a meaningful difference in the new input, since it involves multiple digits and out-of-order summands. But it seems possible that exactly the same learned algorithm is being applied now as was being applied during the late stages of training, and so there won’t be some new parts of the model being activated for the first time.
Deceptive behaviour might be a natural consequence of the successful learned algorithms when they are exposed to appropriate inputs, rather than different machinery that was never triggered during training.
Without hidden behaviors (suitably defined), you can’t have deception.
With hidden behaviors, you can have deception.
The two together give a bit of a lever that I think we can use to bias away from deception if we can find the right operational notion of hidden behaviors.
Not necessarily—depends on how abstractly we’re considering behaviours. (It also depends on how likely we are to detect the bad behaviours during training.)
Consider an AI trained on addition problems that is only exposed to a few problems that look like 1+3=4, 3+7=10, 2+5=7, 2+6=8 during training, where there are two summands which are each a single digit and they appear in ascending order. Now at inference time the model exposed to 10+2= outputs 12.
Have we triggered a hidden behaviour that was never encountered in training? Certainly these inputs were never encountered, and there’s maybe a meaningful difference in the new input, since it involves multiple digits and out-of-order summands. But it seems possible that exactly the same learned algorithm is being applied now as was being applied during the late stages of training, and so there won’t be some new parts of the model being activated for the first time.
Deceptive behaviour might be a natural consequence of the successful learned algorithms when they are exposed to appropriate inputs, rather than different machinery that was never triggered during training.
Right. Maybe a better way to say it is:
Without hidden behaviors (suitably defined), you can’t have deception.
With hidden behaviors, you can have deception.
The two together give a bit of a lever that I think we can use to bias away from deception if we can find the right operational notion of hidden behaviors.