Probably you want the User’s utility function to be a black box (like a large NN) that is computationally hard to thoroughly model, so the Predictor is forced to learn conservatism.
This seems like the Predictor would fail in dumb ways that would serve as teachable moments for alignment researchers.
This doesn’t have to be the “only” type of environment in a training curriculum, but it should certainly be one class of environments.
Plausibly this is a thing we’d want to try and use proto-AGIs to do (as part of a crunch-time sprint) and so working out the kinks early seems very valuable.
This is a good idea to try. I hereby request you to either make it happen, or get other people to make it happen. (Trying to avoid a bystander effect :) )
A few thoughts:
Probably you want the User’s utility function to be a black box (like a large NN) that is computationally hard to thoroughly model, so the Predictor is forced to learn conservatism.
This seems like the Predictor would fail in dumb ways that would serve as teachable moments for alignment researchers.
This doesn’t have to be the “only” type of environment in a training curriculum, but it should certainly be one class of environments.
Plausibly this is a thing we’d want to try and use proto-AGIs to do (as part of a crunch-time sprint) and so working out the kinks early seems very valuable.
This is a good idea to try. I hereby request you to either make it happen, or get other people to make it happen. (Trying to avoid a bystander effect :) )