You aren’t misunderstanding the result, what you say makes complete sense. Now I understand your questions.
“Max out the regret during training” is extremely infeasible—for example, you’d need to know the exact regret bound (which we don’t), and much more importantly you’d need to do perfect adversarial training to find every possible mistake the model makes (which we can’t).
I’m just hoping to directly use the bound on the total loss.
That is, suppose that I’m trying to apply Amplify(A) to some “not too hard” task. Then I want to argue something like:
1. If we’ve built Amplify() correctly, then in order for Amplify(A) to succeed at a “not too hard” task, A just needs to solve some “not too hard” subtasks. (This is the main constraint on Amplify() other than preserving alignment, without this constraint it would be relatively easy.)
2. Because those subtasks are “not too hard,” we think that A has a regret bound vs. some model that solves those tasks well (on average). Roughly speaking this is the definition of “not too hard,” more on this below.
3. So if we keep applying Amplify(A) to the task, and training A on the resulting subtasks, then eventually A will solve those tasks well (on average). At that point, Amplify(A) succeeds at the original task (on average).
4. This argument goes through even if the task is changing over time, as long as the “not too hard” bound applies uniformly over the sequence (such that there is a single model in step #2 that succeeds for every element in the sequence).
In the original example we were discussing, I’m applying Amplify(A) to the task {test whether model B is robust}. When I talk about robustness as a subproblem of alignment, I mean precisely setting up the task {test whether model B is robust} in such a way that it’s “not too hard,” i.e. such that we can apply the above argument to the task {test whether model B is robust}.
Performing well on average is adequate for this task because I can keep applying Amplify(A) to the task {test whether model B is robust}, over and over again. If it fails 10% of the time, that just means 10% of the tests fail, not that model B can fail catastrophically on 10% of inputs or with 10% probability. (This is the magic we want out of a solution to robustness—to turn an average-case guarantee into a worst-case guarantee.)
And I thought one possible interpretation of “do X” is “approximates X well over all possible inputs”. If that’s not what you meant by “do X”, what does it mean?
The X comes with some performance measure. I mean that the model performs well enough on average over the distribution of subtasks.
(This still depends on the distribution of top-level tasks. For now you can imagine imposing this requirement for the worst case distribution of top-level tasks that can occur during training, though I think we can be a little bit more precise in practice.)
In practice we actually care about performance relative to the ground truth, so in addition to the kinds of regret bounds we see in that OL survey paper, we also need to characterize the performance of the hypothesis class relative to the ground truth, right?
Yes, we need to argue that there is some hypothesis in the class that is able to perform well. This is what I mean by “not too hard.” Ultimately we will be assuming that our ML is able to do something impactful in the world, and then trying to argue that if it was able to do that impactful thing, then we could also solve the subtasks necessary to do the same impactful thing safely (since the goal of this approach is to compete with unaligned ML).
In order to argue that a task is not that hard, we will use some combination of:
The assumption that our ML is good enough at some concrete tasks whose difficulty doesn’t scale with the sophistication of the models we are training. This can be verified empirically in advance, and once it’s true it tends to become more true.
The assumption that our ML system is good enough at doing something impactful in the world. (Which we’ll combine with some argument about how the tasks we are interested in are not harder than that task, or “not much harder” in the sense that the difference falls into the preceding category, of “easy” tasks that we’ve demonstrated empirically that our model can solve.)
Overall, things are not really going to be this formal, but hopefully this helps communicate the direction of my thinking/hopes.
“Max out the regret during training” is extremely infeasible
If you can’t max out the regret during training, then I’m having trouble seeing how to make use of such a regret bound the way you want to use it. Let’s say the current total regret is 1000 less than the provable regret bound. Then as far as the provable bound is concerned, the model could answer the next 1000 questions incorrectly and still satisfy the bound, so you can’t just repeat a task some number of times and then conclude that there’s a small probability that all of the answers will be wrong.
For any sequence of queries S, and any model M in the comparison class...
(average performance of A on S) > (average performance of M on S) - (arbitrarily small constant)
This gives you a lower bound on (average performance of A on S). That’s what I want.
Yes, you could get a bad result 1000 times in a row. To guarantee a good result in that setting, you’d need to run 1001 times (which will still probably be a tiny fraction of your overall training time).
What if during training you can’t come close to maxing out regret for the agents that have to be trained with human involvement? That “missing” regret might come due at any time after deployment, and has to be paid with additional oversight/feedback/training data in order for those agents to continue to perform well, right? (In other words, there could be a distributional shift that causes the agents to stop performing well without additional training.) But at that time human feedback may be horribly slow compared to how fast AIs think, thus forcing IDA to either not be competitive with other AIs or to press on without getting enough human feedback to ensure safety.
Am I misunderstanding anything here? (Are you perhaps assuming that we can max out regret during training for the agents that have to be trained with human involvement, but not necessarily for the higher level agents?)
That “missing” regret might come due at any time after deployment, and has to be paid with additional oversight/feedback/training data in order for those agents to continue to perform well, right? (In other words, there could be a distributional shift that causes the agents to stop performing well without additional training.)
Yes. (This is true for any ML system, though for an unaligned system the new training data can just come from the world itself.)
Are you perhaps assuming that we can max out regret during training for the agents that have to be trained with human involvement, but not necessarily for the higher level agents?
Yeah, I’m relatively optimistic that it’s possible to learn enough from humans that the lower level agent remains universal (+ aligned etc.) on arbitrary distributions. This would probably be the case if you managed to consistently break queries down into simpler pieces until arriving at a very simple queries. And of course it would also be the case if you could eliminate the human from the process altogether.
Failing either of those, it’s not clear whether we can do anything formally (vs. expanding the training distribution to cover the kinds of things that look like they might happen, having the human tasks be pretty abstract and independent from details of the situation that change, etc.) I’d still expect to be OK but we’d need to think about it more.
(I still think it’s 50%+ that we can reduce the human to small queries or eliminate them altogether, assuming that iterated amplification works at all, so would prefer start with the “does iterated amplification work at all” question.)
You aren’t misunderstanding the result, what you say makes complete sense. Now I understand your questions.
“Max out the regret during training” is extremely infeasible—for example, you’d need to know the exact regret bound (which we don’t), and much more importantly you’d need to do perfect adversarial training to find every possible mistake the model makes (which we can’t).
I’m just hoping to directly use the bound on the total loss.
That is, suppose that I’m trying to apply Amplify(A) to some “not too hard” task. Then I want to argue something like:
1. If we’ve built Amplify() correctly, then in order for Amplify(A) to succeed at a “not too hard” task, A just needs to solve some “not too hard” subtasks. (This is the main constraint on Amplify() other than preserving alignment, without this constraint it would be relatively easy.)
2. Because those subtasks are “not too hard,” we think that A has a regret bound vs. some model that solves those tasks well (on average). Roughly speaking this is the definition of “not too hard,” more on this below.
3. So if we keep applying Amplify(A) to the task, and training A on the resulting subtasks, then eventually A will solve those tasks well (on average). At that point, Amplify(A) succeeds at the original task (on average).
4. This argument goes through even if the task is changing over time, as long as the “not too hard” bound applies uniformly over the sequence (such that there is a single model in step #2 that succeeds for every element in the sequence).
In the original example we were discussing, I’m applying Amplify(A) to the task {test whether model B is robust}. When I talk about robustness as a subproblem of alignment, I mean precisely setting up the task {test whether model B is robust} in such a way that it’s “not too hard,” i.e. such that we can apply the above argument to the task {test whether model B is robust}.
Performing well on average is adequate for this task because I can keep applying Amplify(A) to the task {test whether model B is robust}, over and over again. If it fails 10% of the time, that just means 10% of the tests fail, not that model B can fail catastrophically on 10% of inputs or with 10% probability. (This is the magic we want out of a solution to robustness—to turn an average-case guarantee into a worst-case guarantee.)
The X comes with some performance measure. I mean that the model performs well enough on average over the distribution of subtasks.
(This still depends on the distribution of top-level tasks. For now you can imagine imposing this requirement for the worst case distribution of top-level tasks that can occur during training, though I think we can be a little bit more precise in practice.)
Yes, we need to argue that there is some hypothesis in the class that is able to perform well. This is what I mean by “not too hard.” Ultimately we will be assuming that our ML is able to do something impactful in the world, and then trying to argue that if it was able to do that impactful thing, then we could also solve the subtasks necessary to do the same impactful thing safely (since the goal of this approach is to compete with unaligned ML).
In order to argue that a task is not that hard, we will use some combination of:
The assumption that our ML is good enough at some concrete tasks whose difficulty doesn’t scale with the sophistication of the models we are training. This can be verified empirically in advance, and once it’s true it tends to become more true.
The assumption that our ML system is good enough at doing something impactful in the world. (Which we’ll combine with some argument about how the tasks we are interested in are not harder than that task, or “not much harder” in the sense that the difference falls into the preceding category, of “easy” tasks that we’ve demonstrated empirically that our model can solve.)
Overall, things are not really going to be this formal, but hopefully this helps communicate the direction of my thinking/hopes.
If you can’t max out the regret during training, then I’m having trouble seeing how to make use of such a regret bound the way you want to use it. Let’s say the current total regret is 1000 less than the provable regret bound. Then as far as the provable bound is concerned, the model could answer the next 1000 questions incorrectly and still satisfy the bound, so you can’t just repeat a task some number of times and then conclude that there’s a small probability that all of the answers will be wrong.
If A satisfies a regret bound, then:
For any sequence of queries S, and any model M in the comparison class...
(average performance of A on S) > (average performance of M on S) - (arbitrarily small constant)
This gives you a lower bound on (average performance of A on S). That’s what I want.
Yes, you could get a bad result 1000 times in a row. To guarantee a good result in that setting, you’d need to run 1001 times (which will still probably be a tiny fraction of your overall training time).
What if during training you can’t come close to maxing out regret for the agents that have to be trained with human involvement? That “missing” regret might come due at any time after deployment, and has to be paid with additional oversight/feedback/training data in order for those agents to continue to perform well, right? (In other words, there could be a distributional shift that causes the agents to stop performing well without additional training.) But at that time human feedback may be horribly slow compared to how fast AIs think, thus forcing IDA to either not be competitive with other AIs or to press on without getting enough human feedback to ensure safety.
Am I misunderstanding anything here? (Are you perhaps assuming that we can max out regret during training for the agents that have to be trained with human involvement, but not necessarily for the higher level agents?)
Yes. (This is true for any ML system, though for an unaligned system the new training data can just come from the world itself.)
Yeah, I’m relatively optimistic that it’s possible to learn enough from humans that the lower level agent remains universal (+ aligned etc.) on arbitrary distributions. This would probably be the case if you managed to consistently break queries down into simpler pieces until arriving at a very simple queries. And of course it would also be the case if you could eliminate the human from the process altogether.
Failing either of those, it’s not clear whether we can do anything formally (vs. expanding the training distribution to cover the kinds of things that look like they might happen, having the human tasks be pretty abstract and independent from details of the situation that change, etc.) I’d still expect to be OK but we’d need to think about it more.
(I still think it’s 50%+ that we can reduce the human to small queries or eliminate them altogether, assuming that iterated amplification works at all, so would prefer start with the “does iterated amplification work at all” question.)