Without the second step, it seems hard to be so pessimistic about the “normal” intervention of “test in a wider range of cases.”
Another way to be pessimistic is you expect that if the test fails on a wider range of cases, it will be unclear how to proceed at that point, and less safety-conscious AI projects may take the lead before you figure that out. (I think this, or a similar point, was made in the MIRI doc.)
At time 0 the human trains the AI to operate at time 1. At time T>>0 the AI trains itself to operate at time T+1, at some point the human no longer needs to be involved—if the AI is actually aligned on inputs that it encounters at time T, then it has a hope of remaining aligned on inputs it encounters at time T+1.
I don’t think this can work if you’re just doing naive imitation learning? Do you have some other training method in mind?
I don’t think this can work if you’re just doing naive imitation learning? Do you have some other training method in mind?
To be clear, I’m imagining imitation learning + amplification. So the agent at time T engages in some deliberative process to produce training targets for the agent at time T+1. The agent at time T also deliberates in order to choose what situations the agent at time T+1 should train on.
What obstruction do you have in mind?
(I’m imagining using imitation+RL rather than pure imitation, but the difference won’t help with this question.)
By “naive imitation learning” I was thinking “without amplification”. With amplification, I’m less sure it won’t work but it still seems pretty iffy. The plan seems to depend on at least the following:
We can create a small (since generated by expensive humans) set of training data that is representative of the data manifold of HBO reasoning (without missing some important part of it).
The data manifold of HBO reasoning is universal, i.e., all future tasks can be broken down (recursively) into subtasks that lie on this manifold.
At each iteration of amplification, the agent being amplified knows how to break down an input task into subtasks that lie on (or is not too far from) its own training distribution. It’s not clear to me how to do this, for example how the agent can obtain a simple enough representation of its own training distribution in order to reason about this problem.
The AI at time T can predict the test distribution at time T+1 well enough to generate training data for it. This seems hard to ensure given that the environment is likely to contain hard to predict elements like other agents, including adversarial agents. (This may not be a dealbreaker if the AI can detect out-of-distribution inputs at time T+1 and ask for further training data on them. Is this what you have in mind?)
I don’t see why to separate 1⁄2, the goal is to find training data that describes some “universal” core for behavior.
3. I don’t think you need to know the training distribution. You just need something that points you back in the direction of the universal core where the human model is competent, e.g. an appropriate notion of simplicity.
4. Hard-to-predict inputs aren’t intrinsically a problem. If your agent fails malignly on input x, but not on distribution D, then your agent is able to distinguish x from D. So the difficulty comes from inputs that are easy-to-recognize but hard-to-generate. These certainly exist (e.g. consider a model which kills everyone given a signed edict from the pope). I think the most likely approach is to “reach inside” the model in order to stress test the behavior on inputs that you can’t actually synthesize (e.g. by understanding that is checking the pope’s signature, and just seeing what would happen if the check passed). This is the endpoint of improvements in these techniques.
(Of course, I don’t think we’ll be able to prevent benign failures in general.)
I don’t see why to separate 1⁄2, the goal is to find training data that describes some “universal” core for behavior.
It seems to me there are separate risks of the human HBO itself not being universal (e.g., humans are not universal or we need even higher bandwidth to be universal), and not being able to capture enough of the human HBO input/output function in a dataset to train an AI to be universal.
3. I don’t think you need to know the training distribution. You just need something that points you back in the direction of the universal core where the human model is competent, e.g. an appropriate notion of simplicity.
What if the path towards the universal core goes through an area where the AI wasn’t trained on?
This is the endpoint of improvements in these techniques.
I think that makes sense but now you’re making a conjunctive instead of disjunctive argument (which it seemed like you were claiming by saying “I think there are lots of possible approaches for dealing with this problem” and listing retraining and optimizing worst case performance as separate approaches).
ETA:
If you’re able to obtain a control guarantee over the whole input space, then that seems to solve the problem and you don’t need constant retraining to be aligned. If you’re only able to obtain it for some subset of inputs, then it seems that at time T the AI needs to be able to predict the T+1 test distribution so that it can make sure that’s covered by the control guarantee.
Another way to be pessimistic is you expect that if the test fails on a wider range of cases, it will be unclear how to proceed at that point, and less safety-conscious AI projects may take the lead before you figure that out. (I think this, or a similar point, was made in the MIRI doc.)
I don’t think this can work if you’re just doing naive imitation learning? Do you have some other training method in mind?
To be clear, I’m imagining imitation learning + amplification. So the agent at time T engages in some deliberative process to produce training targets for the agent at time T+1. The agent at time T also deliberates in order to choose what situations the agent at time T+1 should train on.
What obstruction do you have in mind?
(I’m imagining using imitation+RL rather than pure imitation, but the difference won’t help with this question.)
By “naive imitation learning” I was thinking “without amplification”. With amplification, I’m less sure it won’t work but it still seems pretty iffy. The plan seems to depend on at least the following:
We can create a small (since generated by expensive humans) set of training data that is representative of the data manifold of HBO reasoning (without missing some important part of it).
The data manifold of HBO reasoning is universal, i.e., all future tasks can be broken down (recursively) into subtasks that lie on this manifold.
At each iteration of amplification, the agent being amplified knows how to break down an input task into subtasks that lie on (or is not too far from) its own training distribution. It’s not clear to me how to do this, for example how the agent can obtain a simple enough representation of its own training distribution in order to reason about this problem.
The AI at time T can predict the test distribution at time T+1 well enough to generate training data for it. This seems hard to ensure given that the environment is likely to contain hard to predict elements like other agents, including adversarial agents. (This may not be a dealbreaker if the AI can detect out-of-distribution inputs at time T+1 and ask for further training data on them. Is this what you have in mind?)
I don’t see why to separate 1⁄2, the goal is to find training data that describes some “universal” core for behavior.
3. I don’t think you need to know the training distribution. You just need something that points you back in the direction of the universal core where the human model is competent, e.g. an appropriate notion of simplicity.
4. Hard-to-predict inputs aren’t intrinsically a problem. If your agent fails malignly on input x, but not on distribution D, then your agent is able to distinguish x from D. So the difficulty comes from inputs that are easy-to-recognize but hard-to-generate. These certainly exist (e.g. consider a model which kills everyone given a signed edict from the pope). I think the most likely approach is to “reach inside” the model in order to stress test the behavior on inputs that you can’t actually synthesize (e.g. by understanding that is checking the pope’s signature, and just seeing what would happen if the check passed). This is the endpoint of improvements in these techniques.
(Of course, I don’t think we’ll be able to prevent benign failures in general.)
It seems to me there are separate risks of the human HBO itself not being universal (e.g., humans are not universal or we need even higher bandwidth to be universal), and not being able to capture enough of the human HBO input/output function in a dataset to train an AI to be universal.
What if the path towards the universal core goes through an area where the AI wasn’t trained on?
I think that makes sense but now you’re making a conjunctive instead of disjunctive argument (which it seemed like you were claiming by saying “I think there are lots of possible approaches for dealing with this problem” and listing retraining and optimizing worst case performance as separate approaches).
ETA: If you’re able to obtain a control guarantee over the whole input space, then that seems to solve the problem and you don’t need constant retraining to be aligned. If you’re only able to obtain it for some subset of inputs, then it seems that at time T the AI needs to be able to predict the T+1 test distribution so that it can make sure that’s covered by the control guarantee.