I don’t see why to separate 1⁄2, the goal is to find training data that describes some “universal” core for behavior.
It seems to me there are separate risks of the human HBO itself not being universal (e.g., humans are not universal or we need even higher bandwidth to be universal), and not being able to capture enough of the human HBO input/output function in a dataset to train an AI to be universal.
3. I don’t think you need to know the training distribution. You just need something that points you back in the direction of the universal core where the human model is competent, e.g. an appropriate notion of simplicity.
What if the path towards the universal core goes through an area where the AI wasn’t trained on?
This is the endpoint of improvements in these techniques.
I think that makes sense but now you’re making a conjunctive instead of disjunctive argument (which it seemed like you were claiming by saying “I think there are lots of possible approaches for dealing with this problem” and listing retraining and optimizing worst case performance as separate approaches).
ETA:
If you’re able to obtain a control guarantee over the whole input space, then that seems to solve the problem and you don’t need constant retraining to be aligned. If you’re only able to obtain it for some subset of inputs, then it seems that at time T the AI needs to be able to predict the T+1 test distribution so that it can make sure that’s covered by the control guarantee.
It seems to me there are separate risks of the human HBO itself not being universal (e.g., humans are not universal or we need even higher bandwidth to be universal), and not being able to capture enough of the human HBO input/output function in a dataset to train an AI to be universal.
What if the path towards the universal core goes through an area where the AI wasn’t trained on?
I think that makes sense but now you’re making a conjunctive instead of disjunctive argument (which it seemed like you were claiming by saying “I think there are lots of possible approaches for dealing with this problem” and listing retraining and optimizing worst case performance as separate approaches).
ETA: If you’re able to obtain a control guarantee over the whole input space, then that seems to solve the problem and you don’t need constant retraining to be aligned. If you’re only able to obtain it for some subset of inputs, then it seems that at time T the AI needs to be able to predict the T+1 test distribution so that it can make sure that’s covered by the control guarantee.