I don’t see why to separate 1⁄2, the goal is to find training data that describes some “universal” core for behavior.
3. I don’t think you need to know the training distribution. You just need something that points you back in the direction of the universal core where the human model is competent, e.g. an appropriate notion of simplicity.
4. Hard-to-predict inputs aren’t intrinsically a problem. If your agent fails malignly on input x, but not on distribution D, then your agent is able to distinguish x from D. So the difficulty comes from inputs that are easy-to-recognize but hard-to-generate. These certainly exist (e.g. consider a model which kills everyone given a signed edict from the pope). I think the most likely approach is to “reach inside” the model in order to stress test the behavior on inputs that you can’t actually synthesize (e.g. by understanding that is checking the pope’s signature, and just seeing what would happen if the check passed). This is the endpoint of improvements in these techniques.
(Of course, I don’t think we’ll be able to prevent benign failures in general.)
I don’t see why to separate 1⁄2, the goal is to find training data that describes some “universal” core for behavior.
It seems to me there are separate risks of the human HBO itself not being universal (e.g., humans are not universal or we need even higher bandwidth to be universal), and not being able to capture enough of the human HBO input/output function in a dataset to train an AI to be universal.
3. I don’t think you need to know the training distribution. You just need something that points you back in the direction of the universal core where the human model is competent, e.g. an appropriate notion of simplicity.
What if the path towards the universal core goes through an area where the AI wasn’t trained on?
This is the endpoint of improvements in these techniques.
I think that makes sense but now you’re making a conjunctive instead of disjunctive argument (which it seemed like you were claiming by saying “I think there are lots of possible approaches for dealing with this problem” and listing retraining and optimizing worst case performance as separate approaches).
ETA:
If you’re able to obtain a control guarantee over the whole input space, then that seems to solve the problem and you don’t need constant retraining to be aligned. If you’re only able to obtain it for some subset of inputs, then it seems that at time T the AI needs to be able to predict the T+1 test distribution so that it can make sure that’s covered by the control guarantee.
I don’t see why to separate 1⁄2, the goal is to find training data that describes some “universal” core for behavior.
3. I don’t think you need to know the training distribution. You just need something that points you back in the direction of the universal core where the human model is competent, e.g. an appropriate notion of simplicity.
4. Hard-to-predict inputs aren’t intrinsically a problem. If your agent fails malignly on input x, but not on distribution D, then your agent is able to distinguish x from D. So the difficulty comes from inputs that are easy-to-recognize but hard-to-generate. These certainly exist (e.g. consider a model which kills everyone given a signed edict from the pope). I think the most likely approach is to “reach inside” the model in order to stress test the behavior on inputs that you can’t actually synthesize (e.g. by understanding that is checking the pope’s signature, and just seeing what would happen if the check passed). This is the endpoint of improvements in these techniques.
(Of course, I don’t think we’ll be able to prevent benign failures in general.)
It seems to me there are separate risks of the human HBO itself not being universal (e.g., humans are not universal or we need even higher bandwidth to be universal), and not being able to capture enough of the human HBO input/output function in a dataset to train an AI to be universal.
What if the path towards the universal core goes through an area where the AI wasn’t trained on?
I think that makes sense but now you’re making a conjunctive instead of disjunctive argument (which it seemed like you were claiming by saying “I think there are lots of possible approaches for dealing with this problem” and listing retraining and optimizing worst case performance as separate approaches).
ETA: If you’re able to obtain a control guarantee over the whole input space, then that seems to solve the problem and you don’t need constant retraining to be aligned. If you’re only able to obtain it for some subset of inputs, then it seems that at time T the AI needs to be able to predict the T+1 test distribution so that it can make sure that’s covered by the control guarantee.