I took a look at the debate papers. I think that’s a good angle to take, but they’re missing some factors that sometimes make debates between humans fail.
Humans and neural networks both have some implicit representation of probability distributions of output types. The basis behind “I can’t explain why but that seems unlikely” can be more accurate than “here’s an argument for why that will happen”. You’re basically delegating the problem of “making AI thinking explainable” to the AI itself, but if you could do that, you could just...make neural networks explainable, perhaps by asking an AI what another AI is doing and doing RLHF on the response. But that doesn’t seem to work in general. In other words, the problem is that by using only the arguments output by NNs, those are weaker agents than NNs that don’t have to go through production of arguments.
Reasoning about probability distributions means argument branches can be of the type “X is a likely type of thing” vs “X is rare”. And empirically checking the distribution can be too expensive. That makes the debate framework not work as well.
Strongly agree on the first challenge; on the theory workstream we’re thinking about how to deal with this problem. Some past work (not from us) is here and here.
Though to be clear, I don’t think the empirical evidence clearly rules out “just making neural networks explainable”. Imo, if you wanted to do that, you would do things in the style of debate and prover-verifier games. These ideas just haven’t been tried very much yet. I don’t think “asking an AI what another AI is doing and doing RLHF on the response” is nearly as good; that is much more likely to lead to persuasive explanations that aren’t correct.
I’m not that compelled by the second challenge yet (though I’m not sure I understand what you mean). My main question here is how the AI system knows that X is likely or that X is rare, and why it can’t just explain that to the judge. E.g. if I want to argue that it is rare to find snow in Africa, I would point to weather data I can find online, or point to the fact that Africa is mostly near the Equator, I wouldn’t try to go to different randomly sampled locations and times in Africa and measure whether or not I found snow there.
To clarify the 2nd point, here’s an example. Suppose someone presents you with a large box that supposedly produces electricity endlessly. Your boss thinks it works, and you’re debating the inventor in front of your boss.
“Perpetual motion machines are known to be impossible” you say, but your boss isn’t familiar with that conceptual class or the reasoning involved.
The inventor says, “Here, let’s plug in a thing, we can see that the box does in fact produce a little electricity.” Your boss finds this very convincing.
The process proposed in the paper is something like, “let’s randomly sample every possible machine to see if it does perpetual motion”. So the inventor points to the sun and says, “that thing has been making energy continuously and never stops for as long as we’ve been able to tell”. They point to some stars and say the same thing.
The sampling and evaluation is dependent on a conceptual framework that isn’t agreed on, and waiting for the sun and stars to burn out isn’t very practical.
You should at least be able to argue that the evidence does not support the conclusion, and that the boss should have substantial probability on “the box can make some electricity but not infinitely much”.
You can recursively decompose the claim “perpetual motion machines are known to be impossible” until you get down to a claim like “such and such experiment should have such and such outcome”, which the boss can then perform to determine a winner.
This does not mean that the boss then understands why perpetual motion machines are impossible—an important aspect of debate that it aims to produce good oversight of claims without giving the judge an understanding of those claims.
This particular approach will likely run into the problem of obfuscated arguments though.
The debaters are meant to be copies of the same AI, and to receive exactly the same information, with the hope that each knows what the other knows. In the example, this hopefully means that you understand how the inventor is tricking your boss, and you can simply point it out and explain it.
If the inventor legitimately believes the box produces infinite electricity, this won’t work, but also I consider that out of scope for what debate needs to do. We’re in the business of getting the best answer given the AI’s knowledge, not the true answer.
If both you and the inventor know that the claim is impossible from theory, but don’t know the local error that the inventor made, this won’t work.
You can cross-examine the inventor and show that in other contexts they would agree that perpetual energy machines are impossible. (Roughly speaking, cross-examination = wiping memory and asking a new question.)
The process proposed in the paper
Which paper are you referring to? If you mean doubly efficient debate, then I believe the way doubly efficient debate would be applied here is to argue about what the boss would conclude if he thought about it for a long time.
You can recursively decompose the claim “perpetual motion machines are known to be impossible” until you get down to a claim like “such and such experiment should have such and such outcome”, which the boss can then perform to determine a winner.
Ah, I don’t think you can. Making that kind of abstract conclusion from a practical number of experiments requires abstractions like potential energy, entropy, Noether’s theorem, etc—which in this example, the judge doesn’t understand. (Without such abstractions, you’d need to consider every possible type of machine separately, which isn’t feasible.) This seems like a core of our disagreement here.
You can cross-examine the inventor and show that in other contexts they would agree that perpetual energy machines are impossible.
The debaters are the same AI with different contexts, so the same is true of both debaters. Am I missing something here?
Which paper are you referring to? If you mean doubly efficient debate
Making that kind of abstract conclusion from a practical number of experiments requires abstractions like potential energy, entropy, Noether’s theorem, etc—which in this example, the judge doesn’t understand. (Without such abstractions, you’d need to consider every possible type of machine separately, which isn’t feasible.)
I agree, but I don’t see why that matters. As I mentioned, a main point of debate is to produce good oversight of claims without giving the judge an understanding of those claims. In this example I would imagine that you decompose the argument as:
A fundamental law of physics is conservation of energy: energy can neither be created nor destroyed, only transformed from one form to another.
Electricity is a form of energy.
This box does not have an infinite source of energy.
The above three together imply that the box cannot produce infinite electricity.
The inventor can disagree with one or more of these claims, then we sample one of the disagreements, and continue debating that one alone, ignoring all the others. This doesn’t mean the judge understands the other claims, just that the judge isn’t addressing them when deciding who wins the overall debate.
If we recurse on #1, which I expect you think is the hardest one, then you could have a decomposition like “the principle has been tested many times”, “in the tests, confirming evidence outweighs the disconfirming evidence”, “there is an overwhelming scientific consensus behind it”, “there is significant a priori theoretical support” (assuming that’s true), “given the above the reasonable conclusion is to have very high confidence in conservation of energy”. Again, find disagreements, sample one, recurse. It seems quite plausible to me that you get down to something fairly concrete relatively quickly.
If you want to disallow appeals to authority, on the basis that the correct analogy is to superhuman AIs that know tons of stuff that aren’t accepted by any authorities the judge trusts, I still think it’s probably doable with a larger debate, but it’s harder for me to play out what the debate would look like because I don’t know in enough concrete detail the specific reasons why we believe conservation of energy to be true. I might also disagree that we should be thinking about such big gaps between AI and the judge, but that’s not central.
The debaters are the same AI with different contexts, so the same is true of both debaters. Am I missing something here?
That seems right, but why is it a problem?
The honest strategy is fine under cross-examination, it will give consistent answers across contexts. Only the dishonest strategy will change its answers (sometimes saying the perpetual energy machines are impossible sometimes saying that they are possible).
I do, but more importantly, I want to disallow the judge understanding all the concepts here. Suppose the judge says to #1: “What is energy?” or “What is conservation?” and it can’t be explained to them—what then?
Also, argument 1 isn’t actually correct, E=mc^2 and so on.
That seems right, but why is it a problem? The honest strategy is fine under cross-examination, it will give consistent answers across contexts.
“The honest strategy”? If you have that, you can just ask it and not bother with the debate. If the problem is distinguishing it, and only dishonest actors are changing their answers based on the provided situation, you can just use that info. But why are you assuming you have an “honest strategy” available here?
I do, but more importantly, I want to disallow the judge understanding all the concepts here.
I think I don’t actually care about being robust to this assumption. Generally I think of arbitrarily-scalable-debate as depending on a universality assumption (which in turn would rule out “the judge can never understand the concepts”). But even if the universality assumption is false, it wouldn’t bother me much; I don’t expect such a huge gap between debaters and judges that the judge simply can’t understand the debaters’ concepts, even given arbitrary amounts of time and arbitrary amounts of explanation from the debaters. (Importantly, I would want to bootstrap alignment, to keep the gaps between debaters and the judge relatively small.)
“The honest strategy”? If you have that, you can just ask it and not bother with the debate. If the problem is distinguishing it, and only dishonest actors are changing their answers based on the provided situation, you can just use that info. But why are you assuming you have an “honest strategy” available here?
The general structure of a debate theorem is: if you set up the game in such-and-such way, then a strategy that simply answers honestly will dominate any other strategy.
So in this particular case I am saying: if you penalize debaters that are inconsistent under cross-examination, you are giving an advantage to any debater that implements an honest strategy, and so you should expect training to incentivize honesty.
I don’t expect such a huge gap between debaters and judges that the judge simply can’t understand the debaters’ concepts
You don’t? But this is a major problem in arguments between people. The variation within humans is already more than enough for this! There’s a gap like that every 35 IQ points or so. I don’t understand why you’re confident this isn’t an issue.
I guess we’ve found our main disagreement, at least?
So in this particular case I am saying: if you penalize debaters that are inconsistent under cross-examination, you are giving an advantage to any debater that implements an honest strategy, and so you should expect training to incentivize honesty.
Now you’re training for multiple objectives:
You want the debater AI to argue for proposition A or not-A according to its role and convince human judges of that.
You want it to not change its position on sub-arguments.
But (2) is ill-defined. Can sub-arguments be combined for less weighting? Are they all worth the same? What if you have several sub-arguments that all depend on a single sub-2-argument? Good arguments for A or not-A should have lots of disagreements—or do you want to train AI that makes all the same sub-arguments for A or not-A and then says “this implies A / not-A”? I don’t think this works.
In response to the linked “HCH” post:
Yes, an agent past some threshold can theoretically make a more-intelligent agent. But that doesn’t say anything about alignment; the supposed “question-answering machine” would be subject to instrumental convergence and mesaoptimizer issues, and you’d get value drift with each HCH stage, just as you would with RSI schemes.
@bhauth@Rohin Shah I think that bhauth has an important point here about the danger of large gaps between judge and debaters. Similarly, between a trusted overseer and a smarter worker. Keeping the gaps small is really important for a lot of oversight plans to work out well!
Here’s some research I am doing which I think answers this point thoroughly: it is possible to smoothly, continuously, incrementally scale-down the capabilities of a model by injecting carefully controlled amounts of noise into its activations. I’m calling this ‘noise injection impairment’.
This removes the need to have precisely created a whole series of models with precise capability steps between each one. You can instead train a single strong model, and scale it all the way down to be just a tiny step above the next most strong model. Then you create as large a number of intermediate steps of capability as you need by reducing the noise magnitude.
Without this technique, then I believe bhauth’s point would stand, and capability gaps between model versions would lead to dangerous failures of various control and monitoring schemes.
I think the basic idea of using more steps of smaller size is worth considering. Maybe it reduces overall drift, but I suspect it doesn’t, because my view is:
Models have many basins of attraction for sub-elements. As model capability increases continuously, there are nearly-discrete points where aspects of the model jump from 1 basin to another, perhaps with cascading effect. I expect this to produce large gaps from small changes to models.
I’m not going to repeat all of the literature on debate here, but as brief pointers:
Factored cognition discusses intuitively why we can hope to approximate exponentially-sized trees of arguments (which would be tremendously bigger than arguments between people)
AI safety via debate makes the same argument for debate (by showing that a polynomial time judge can supervise PSPACE—PSPACE-complete problems typically involve exponential-sized trees)
This paper discusses the experiments you’d do to figure out what the human judge should be doing to make debate more effective
The comments on this post discuss several reasons not to anchor to human institutions. There are even more reasons not to anchor to disagreements between people, but I didn’t find a place where they’ve been written up with a short search. Most centrally, disagreements between people tend to focus on getting both people to understand their position, but the theoretical story for debate does not require this.
(Also, the “arbitrary amounts of time and arbitrary amounts of explanation” was pretty central to my claim; human disagreements are way more bounded than that.)
The scope of our argument seems to have grown beyond what a single comment thread is suitable for.
AI safety via debate is 2 years before Writeup: Progress on AI Safety via Debate so the latter post should be more up-to-date. I think that post does a good job of considering potential problems; the issue is that I think the noted problems & assumptions can’t be handled well, make that approach very limited in what it can do for alignment, and aren’t really dealt with by “Doubly-efficient debate”. I don’t think such debate protocols are totally useless, but they’re certainly not a “solution to alignment”.
I took a look at the debate papers. I think that’s a good angle to take, but they’re missing some factors that sometimes make debates between humans fail.
Humans and neural networks both have some implicit representation of probability distributions of output types. The basis behind “I can’t explain why but that seems unlikely” can be more accurate than “here’s an argument for why that will happen”. You’re basically delegating the problem of “making AI thinking explainable” to the AI itself, but if you could do that, you could just...make neural networks explainable, perhaps by asking an AI what another AI is doing and doing RLHF on the response. But that doesn’t seem to work in general. In other words, the problem is that by using only the arguments output by NNs, those are weaker agents than NNs that don’t have to go through production of arguments.
Reasoning about probability distributions means argument branches can be of the type “X is a likely type of thing” vs “X is rare”. And empirically checking the distribution can be too expensive. That makes the debate framework not work as well.
Strongly agree on the first challenge; on the theory workstream we’re thinking about how to deal with this problem. Some past work (not from us) is here and here.
Though to be clear, I don’t think the empirical evidence clearly rules out “just making neural networks explainable”. Imo, if you wanted to do that, you would do things in the style of debate and prover-verifier games. These ideas just haven’t been tried very much yet. I don’t think “asking an AI what another AI is doing and doing RLHF on the response” is nearly as good; that is much more likely to lead to persuasive explanations that aren’t correct.
I’m not that compelled by the second challenge yet (though I’m not sure I understand what you mean). My main question here is how the AI system knows that X is likely or that X is rare, and why it can’t just explain that to the judge. E.g. if I want to argue that it is rare to find snow in Africa, I would point to weather data I can find online, or point to the fact that Africa is mostly near the Equator, I wouldn’t try to go to different randomly sampled locations and times in Africa and measure whether or not I found snow there.
To clarify the 2nd point, here’s an example. Suppose someone presents you with a large box that supposedly produces electricity endlessly. Your boss thinks it works, and you’re debating the inventor in front of your boss.
“Perpetual motion machines are known to be impossible” you say, but your boss isn’t familiar with that conceptual class or the reasoning involved.
The inventor says, “Here, let’s plug in a thing, we can see that the box does in fact produce a little electricity.” Your boss finds this very convincing.
The process proposed in the paper is something like, “let’s randomly sample every possible machine to see if it does perpetual motion”. So the inventor points to the sun and says, “that thing has been making energy continuously and never stops for as long as we’ve been able to tell”. They point to some stars and say the same thing.
The sampling and evaluation is dependent on a conceptual framework that isn’t agreed on, and waiting for the sun and stars to burn out isn’t very practical.
There are several different outs to this example:
You should at least be able to argue that the evidence does not support the conclusion, and that the boss should have substantial probability on “the box can make some electricity but not infinitely much”.
You can recursively decompose the claim “perpetual motion machines are known to be impossible” until you get down to a claim like “such and such experiment should have such and such outcome”, which the boss can then perform to determine a winner.
This does not mean that the boss then understands why perpetual motion machines are impossible—an important aspect of debate that it aims to produce good oversight of claims without giving the judge an understanding of those claims.
This particular approach will likely run into the problem of obfuscated arguments though.
The debaters are meant to be copies of the same AI, and to receive exactly the same information, with the hope that each knows what the other knows. In the example, this hopefully means that you understand how the inventor is tricking your boss, and you can simply point it out and explain it.
If the inventor legitimately believes the box produces infinite electricity, this won’t work, but also I consider that out of scope for what debate needs to do. We’re in the business of getting the best answer given the AI’s knowledge, not the true answer.
If both you and the inventor know that the claim is impossible from theory, but don’t know the local error that the inventor made, this won’t work.
You can cross-examine the inventor and show that in other contexts they would agree that perpetual energy machines are impossible. (Roughly speaking, cross-examination = wiping memory and asking a new question.)
Which paper are you referring to? If you mean doubly efficient debate, then I believe the way doubly efficient debate would be applied here is to argue about what the boss would conclude if he thought about it for a long time.
Ah, I don’t think you can. Making that kind of abstract conclusion from a practical number of experiments requires abstractions like potential energy, entropy, Noether’s theorem, etc—which in this example, the judge doesn’t understand. (Without such abstractions, you’d need to consider every possible type of machine separately, which isn’t feasible.) This seems like a core of our disagreement here.
The debaters are the same AI with different contexts, so the same is true of both debaters. Am I missing something here?
Yes, “doubly efficient debate”.
I agree, but I don’t see why that matters. As I mentioned, a main point of debate is to produce good oversight of claims without giving the judge an understanding of those claims. In this example I would imagine that you decompose the argument as:
A fundamental law of physics is conservation of energy: energy can neither be created nor destroyed, only transformed from one form to another.
Electricity is a form of energy.
This box does not have an infinite source of energy.
The above three together imply that the box cannot produce infinite electricity.
The inventor can disagree with one or more of these claims, then we sample one of the disagreements, and continue debating that one alone, ignoring all the others. This doesn’t mean the judge understands the other claims, just that the judge isn’t addressing them when deciding who wins the overall debate.
If we recurse on #1, which I expect you think is the hardest one, then you could have a decomposition like “the principle has been tested many times”, “in the tests, confirming evidence outweighs the disconfirming evidence”, “there is an overwhelming scientific consensus behind it”, “there is significant a priori theoretical support” (assuming that’s true), “given the above the reasonable conclusion is to have very high confidence in conservation of energy”. Again, find disagreements, sample one, recurse. It seems quite plausible to me that you get down to something fairly concrete relatively quickly.
If you want to disallow appeals to authority, on the basis that the correct analogy is to superhuman AIs that know tons of stuff that aren’t accepted by any authorities the judge trusts, I still think it’s probably doable with a larger debate, but it’s harder for me to play out what the debate would look like because I don’t know in enough concrete detail the specific reasons why we believe conservation of energy to be true. I might also disagree that we should be thinking about such big gaps between AI and the judge, but that’s not central.
That seems right, but why is it a problem?
The honest strategy is fine under cross-examination, it will give consistent answers across contexts. Only the dishonest strategy will change its answers (sometimes saying the perpetual energy machines are impossible sometimes saying that they are possible).
I do, but more importantly, I want to disallow the judge understanding all the concepts here. Suppose the judge says to #1: “What is energy?” or “What is conservation?” and it can’t be explained to them—what then?
Also, argument 1 isn’t actually correct, E=mc^2 and so on.
“The honest strategy”? If you have that, you can just ask it and not bother with the debate. If the problem is distinguishing it, and only dishonest actors are changing their answers based on the provided situation, you can just use that info. But why are you assuming you have an “honest strategy” available here?
I think I don’t actually care about being robust to this assumption. Generally I think of arbitrarily-scalable-debate as depending on a universality assumption (which in turn would rule out “the judge can never understand the concepts”). But even if the universality assumption is false, it wouldn’t bother me much; I don’t expect such a huge gap between debaters and judges that the judge simply can’t understand the debaters’ concepts, even given arbitrary amounts of time and arbitrary amounts of explanation from the debaters. (Importantly, I would want to bootstrap alignment, to keep the gaps between debaters and the judge relatively small.)
The general structure of a debate theorem is: if you set up the game in such-and-such way, then a strategy that simply answers honestly will dominate any other strategy.
So in this particular case I am saying: if you penalize debaters that are inconsistent under cross-examination, you are giving an advantage to any debater that implements an honest strategy, and so you should expect training to incentivize honesty.
You don’t? But this is a major problem in arguments between people. The variation within humans is already more than enough for this! There’s a gap like that every 35 IQ points or so. I don’t understand why you’re confident this isn’t an issue.
I guess we’ve found our main disagreement, at least?
Now you’re training for multiple objectives:
You want the debater AI to argue for proposition A or not-A according to its role and convince human judges of that.
You want it to not change its position on sub-arguments.
But (2) is ill-defined. Can sub-arguments be combined for less weighting? Are they all worth the same? What if you have several sub-arguments that all depend on a single sub-2-argument? Good arguments for A or not-A should have lots of disagreements—or do you want to train AI that makes all the same sub-arguments for A or not-A and then says “this implies A / not-A”? I don’t think this works.
In response to the linked “HCH” post:
Yes, an agent past some threshold can theoretically make a more-intelligent agent. But that doesn’t say anything about alignment; the supposed “question-answering machine” would be subject to instrumental convergence and mesaoptimizer issues, and you’d get value drift with each HCH stage, just as you would with RSI schemes.
@bhauth @Rohin Shah I think that bhauth has an important point here about the danger of large gaps between judge and debaters. Similarly, between a trusted overseer and a smarter worker. Keeping the gaps small is really important for a lot of oversight plans to work out well!
Here’s some research I am doing which I think answers this point thoroughly: it is possible to smoothly, continuously, incrementally scale-down the capabilities of a model by injecting carefully controlled amounts of noise into its activations. I’m calling this ‘noise injection impairment’.
This removes the need to have precisely created a whole series of models with precise capability steps between each one. You can instead train a single strong model, and scale it all the way down to be just a tiny step above the next most strong model. Then you create as large a number of intermediate steps of capability as you need by reducing the noise magnitude.
Without this technique, then I believe bhauth’s point would stand, and capability gaps between model versions would lead to dangerous failures of various control and monitoring schemes.
Link to details of ongoing research: https://www.apartresearch.com/project/sandbag-detection-through-model-degradation
I think the basic idea of using more steps of smaller size is worth considering. Maybe it reduces overall drift, but I suspect it doesn’t, because my view is:
I’m not going to repeat all of the literature on debate here, but as brief pointers:
Factored cognition discusses intuitively why we can hope to approximate exponentially-sized trees of arguments (which would be tremendously bigger than arguments between people)
AI safety via debate makes the same argument for debate (by showing that a polynomial time judge can supervise PSPACE—PSPACE-complete problems typically involve exponential-sized trees)
Cross-examination is discussed here
This paper discusses the experiments you’d do to figure out what the human judge should be doing to make debate more effective
The comments on this post discuss several reasons not to anchor to human institutions. There are even more reasons not to anchor to disagreements between people, but I didn’t find a place where they’ve been written up with a short search. Most centrally, disagreements between people tend to focus on getting both people to understand their position, but the theoretical story for debate does not require this.
(Also, the “arbitrary amounts of time and arbitrary amounts of explanation” was pretty central to my claim; human disagreements are way more bounded than that.)
The scope of our argument seems to have grown beyond what a single comment thread is suitable for.
AI safety via debate is 2 years before Writeup: Progress on AI Safety via Debate so the latter post should be more up-to-date. I think that post does a good job of considering potential problems; the issue is that I think the noted problems & assumptions can’t be handled well, make that approach very limited in what it can do for alignment, and aren’t really dealt with by “Doubly-efficient debate”. I don’t think such debate protocols are totally useless, but they’re certainly not a “solution to alignment”.