Donald Hobson comments on Confucianism in AI Alignment

Donald Hobson 3 Nov 2020 11:42 UTC
LW: 6 AF: 4
AF
If an inner optimizer could exploit some distribution shift between the training and deployment environments, then performance-in-training is a bad proxy for performance-in-deployment.
Suppose you are making a self driving car. The training environment is a videogame like environment. The rendering is pretty good. A human looking at the footage would not easily be able to say it was obviously fake. An expert going over the footage in detail could spot subtle artefacts. The diffuse translucency on leaves in the background isn’t quite right. When another car drives through a puddle, all the water drops are perfectly spherical, and travel on parabolic paths. Falling snow doesn’t experience aerodynamic turbulence. Etc.
The point is that the behaviour you want is avoiding other cars and lamp posts. The simulation is close enough to reality that it is easy to match virtual lamp posts to real ones. However the training and testing environments have a different distribution.
Making the simulated environment absolutely pixel perfect would be very hard, and doesn’t seem like it should be necessary.
However, given even a slight variation between training and the real world, there exists an agent that will behave well in training, but cause problems in the real world. And also an agent that behaves fine in training and the real world. The set of possible behaviours is vast. You can’t consider all of them. You can’t even store a single arbitrary behaviour. Because you cant train on all possible situations, there will be behaviours that behave the same on all the training situations, but behave differently in other situations. You need some part of your design that favours some policies over others without training data. For example, you might want a policy that can be described as parameters in a particular neural net. You have to look at how this effects off distribution actions.
The analogous situation with managers would be that the person being tested knows they are being tested. If you get them to display benevolent leadership, then you can’t distinguish benevolent leaders from sociopaths who can act nice to pass the test.