Ah, I wasn’t aware of that document! Very helpful. The section previous to the one you link to seems quite relevant to my overall concerns, pointing in the direction of “yeah, in practice human judges have a lot of trouble incentivising debaters to properly justify their claims and defend them from critiques”. The rest of the document also seems potentially relevant to my confusions.
However, as Vojta mentions, asking the debaters to provide answers simultaneously seems to alleviate my concern about the equilibrium only by exacerbating the problem of providing good feedback toward the end of training; particularly in a deep NN version where the two debaters are actually using the same NN, there needs to be some way to break the symmetry, preventing both players from selecting the same answer all the time.
The asymmetric version of that, where one player chooses first, has the problem I mentioned: we will tend to know that the second player is more likely lying. OTOH, if we attempted a more symmetric version, where the two player’s answers are somehow pushed apart without favoring one or the other of them, then both players are probably lying (since you have to push them both away from the best answer). So I don’t see a viable way of symmetrizing responses for free-choice questions.
I like Vanessa’s proposal of restricting to multiple-choice questions rather than free-response questions, and pre-assigning debaters to specific positions.
there needs to be some way to break the symmetry, preventing both players from selecting the same answer all the time.
You can just rejection sample—if both players give the same answer, just resample the answers / move on to a new question.
“Same answer” can be evaluated by a human, or by an automated model.
If rejection sampling is extremely inefficient (almost all answers are the same) then it seems like you’re probably done with training. But if you really wanted to continue, you can probably importance sample in order to ensure different answers, as long as you can evaluate the original probability of any given answer.
If rejection sampling samples new questions as well as new answers, then we still get no training signal on cases where debaters agree. This seems potentially concerning, since agreement isn’t a great signal that things have actually converged to good answers.
Part of the concern is that, since both debaters are actually the same NN, you get the same answer early, not only as you converge to good answers. It’s not clear that this can be solved just by adding randomness to the NNs, because the NNs might train to eliminate the influence of the noise fairly early—after all, the noise isn’t providing useful information for them.
Rejection sampling could dwell on the same question, only resampling answers. This seems like a better solution. But it might still imply a whole lot of resampling. Again, even if there isn’t convergence to a really good answer yet, the gradient might be away from randomness, because (a) the right answer is too rare, (b) the vast majority of mutated answers are far worse than the noise-free answer, (c) the AI isn’t yet any good at defending the right answer from attacks.
Symmetrizing for the judge, not the AIs.
Another potential solution is to keep the asymmetry of “one AI decides on an answer first; then the second AI selects an answer”—but hide who is who from the judge. Present the AIs symmetrically, giving away no information about who was “first”. This mostly solves my problem.
However, one might worry that AIs would find some way to argue about who was really first vs second. Perhaps there are some signs which may indicate that a particular answer was second—“Your answer is the sort of answer which someone would choose after seeing my answer, and trying to select something close”. If such cases could be persuasively made, then the judge should find it to be significant evidence.
This is especially plausible if the AIs are allowed to make psychological claims about each other, and perhaps even examine each other’s NN activations to back up such claims. In this case, AI #1 could show that AI #2 saw AI #1s answer initially, when it was selecting its own answer—thus breaking the symmetry between the agents. This is plausible, because we might want to do that as a transparency tool. However, it completely ruins the mechanism I’m proposing here, since the mechanism depends on hiding information from the judge.
This seems to basically address my concern. However, maybe binary questions are a bad idea. One critique I’ve been given (due to Misha Yagudin) is that apparently binary questions are never really so simple, due to interpretation. Too many complex questions end up having answers fitting categories like “it depends” and “both are wrong”. For one thing, this means you want free-response answers in order to get the really useful answers (which point out wrong or fuzzy assumptions in your question). For another thing, this means that if you do try to stick to binary questions, the debates will spiral into interpretational issues which are not so likely to be useful.
Answer different questions.
This proposal is due to Scott Garrabrant, and I think it resolves my issue completely. However, Scott had a very different motivation for the proposal, which I don’t expect to do justice to. I also expect the version I’ll write out may be a little simpler than Scott’s, missing some ideas which were important to him.
Why should the two debaters be forced to give different answers to the same question? Why not get them to answer two entirely different questions, and judge which of the two was more persuasive?
As you mentioned in the parent, it’s not really obvious in general whether two answers are “the same answer”. It’s also not obvious whether two answers are really mutually exclusive. The two answers might be in different ontologies, and very hard to compare. For example, a question about large-scale irrigation might get one answer in the ontology of engineering, which discusses questions of fluid mechanics and so on; and another in the ontology of economics, which discusses mechanism design, ethics of distribution of benefits of state-funded infrastructure, etc. It could be difficult to put these into common terms to decide whether they’re mutually exclusive. But maybe this is a fake difficulty; we don’t actually care whether they’re mutually exclusive, we just have to rank the two different sides for convincingness. This perspective allows us to sample entirely different questions for the two sides, eliminating the training-signal problem of overlapping answers.
The feb 2020 proposal for symmetrizing debate which you cited mandates that both sides develop their own argument, and then attack weak points in the other’s argument. So you can think of this as a 4-player game:
Assertion A and assertion B are sampled. These may be totally unrelated, so the risk of A=B is very low.
AD, the defense of A, is a player with the job of making a constructive argument for A.
AC, the critic for A, is a player with the job of critiquing the argument for A.
BD defends B.
BC critiques the defense of B.
AD and BC are on a team; BD and AC are on a team. The judge decides which of the teams did better.
Otherwise, the proposal is much the same as the feb 2020 proposal.
After training, how do we use this model to answer questions? After all, unlike the usual case, the AIs are not directly learning to give answers to questions. They are only learning to defend claims (and critique defenses of claims).
Scott’s proposal is that you then utilize the system by asking it to defend claims such as: “In order to build a rocket ship, I will follow the first suggestion you make in your defense of this claim.”, “You will convince me that the best answer to the question ‘what will the weather be like tomorrow’ is the first such answer you mention in your defense of this claim”, etc.
Scott’s solution is obviously a bit complicated compared to the usual debate setup, but his point was that the apparent simplicity of the usual setup is actually hiding something, because you don’t really get anything out of the assumption that the two players are answering the same question.
It seems like you’ve ignored the possibility of importance sampling?
More broadly if this ends up being a problem it’s basically an exploration problem that I expect we can solve with simple ML tricks. E.g. you could include an entropy bonus so that the agents are incentivized to say different things, and anneal that away as training progresses.
his point was that the apparent simplicity of the usual setup is actually hiding something, because you don’t really get anything out of the assumption that the two players are answering the same question.
Sure? I feel like the argument for safety is that you have two equally-matched players that are incentivized to find flaws in each other’s arguments, which is also true in Scott’s proposal. It doesn’t feel to me like that argument for safety depended much on them answering the same question.
(I feel like I’m restating what you said, I guess I’m confused why you interpret this as evidence that the simplicity of the setup is “hiding something”.)
It seems like you’ve ignored the possibility of importance sampling?
Ah, right, I agree. I forgot about that suggestion as I was writing. It seems likely some version of this would work.
(I feel like I’m restating what you said, I guess I’m confused why you interpret this as evidence that the simplicity of the setup is “hiding something”.)
Yep, sorry, I think you should take that as something-about-Scott’s-point-abram-didn’t-explain. I still disclaim myself as maybe missing part of Scott’s point. But: what the simpler setup is “hiding” is the complexity of comparing answers:
The complexity of determining whether two claims are “different”.
The complexity of determining whether two claims are mutually exclusive.
The complexity of comparing the quality of different arguments, when the different answers may be expressed in very different ontologies, and deal with very difficult-to-compare considerations.
Making the two sides defend entirely unrelated claims makes all this obvious. In addition, it makes the first two bullet points irrelevant, removing a “fake difficulty” from the setup.
Okay, that all makes sense. One maybe-caveat-or-disagreement:
The complexity of comparing the quality of different arguments, when the different answers may be expressed in very different ontologies, and deal with very difficult-to-compare considerations.
I do think that answering the same question does make it meaningfully easier to compare answers, though I agree it’s still not obvious that it’s easy on some absolute scale for the reasons you outline.
Ah, I wasn’t aware of that document! Very helpful. The section previous to the one you link to seems quite relevant to my overall concerns, pointing in the direction of “yeah, in practice human judges have a lot of trouble incentivising debaters to properly justify their claims and defend them from critiques”. The rest of the document also seems potentially relevant to my confusions.
However, as Vojta mentions, asking the debaters to provide answers simultaneously seems to alleviate my concern about the equilibrium only by exacerbating the problem of providing good feedback toward the end of training; particularly in a deep NN version where the two debaters are actually using the same NN, there needs to be some way to break the symmetry, preventing both players from selecting the same answer all the time.
The asymmetric version of that, where one player chooses first, has the problem I mentioned: we will tend to know that the second player is more likely lying. OTOH, if we attempted a more symmetric version, where the two player’s answers are somehow pushed apart without favoring one or the other of them, then both players are probably lying (since you have to push them both away from the best answer). So I don’t see a viable way of symmetrizing responses for free-choice questions.
I like Vanessa’s proposal of restricting to multiple-choice questions rather than free-response questions, and pre-assigning debaters to specific positions.
You can just rejection sample—if both players give the same answer, just resample the answers / move on to a new question.
“Same answer” can be evaluated by a human, or by an automated model.
If rejection sampling is extremely inefficient (almost all answers are the same) then it seems like you’re probably done with training. But if you really wanted to continue, you can probably importance sample in order to ensure different answers, as long as you can evaluate the original probability of any given answer.
Resampling.
If rejection sampling samples new questions as well as new answers, then we still get no training signal on cases where debaters agree. This seems potentially concerning, since agreement isn’t a great signal that things have actually converged to good answers.
Part of the concern is that, since both debaters are actually the same NN, you get the same answer early, not only as you converge to good answers. It’s not clear that this can be solved just by adding randomness to the NNs, because the NNs might train to eliminate the influence of the noise fairly early—after all, the noise isn’t providing useful information for them.
Rejection sampling could dwell on the same question, only resampling answers. This seems like a better solution. But it might still imply a whole lot of resampling. Again, even if there isn’t convergence to a really good answer yet, the gradient might be away from randomness, because (a) the right answer is too rare, (b) the vast majority of mutated answers are far worse than the noise-free answer, (c) the AI isn’t yet any good at defending the right answer from attacks.
Symmetrizing for the judge, not the AIs.
Another potential solution is to keep the asymmetry of “one AI decides on an answer first; then the second AI selects an answer”—but hide who is who from the judge. Present the AIs symmetrically, giving away no information about who was “first”. This mostly solves my problem.
However, one might worry that AIs would find some way to argue about who was really first vs second. Perhaps there are some signs which may indicate that a particular answer was second—“Your answer is the sort of answer which someone would choose after seeing my answer, and trying to select something close”. If such cases could be persuasively made, then the judge should find it to be significant evidence.
This is especially plausible if the AIs are allowed to make psychological claims about each other, and perhaps even examine each other’s NN activations to back up such claims. In this case, AI #1 could show that AI #2 saw AI #1s answer initially, when it was selecting its own answer—thus breaking the symmetry between the agents. This is plausible, because we might want to do that as a transparency tool. However, it completely ruins the mechanism I’m proposing here, since the mechanism depends on hiding information from the judge.
Use binary questions, and assign positions rather than allowing free-response answers.
This seems to basically address my concern. However, maybe binary questions are a bad idea. One critique I’ve been given (due to Misha Yagudin) is that apparently binary questions are never really so simple, due to interpretation. Too many complex questions end up having answers fitting categories like “it depends” and “both are wrong”. For one thing, this means you want free-response answers in order to get the really useful answers (which point out wrong or fuzzy assumptions in your question). For another thing, this means that if you do try to stick to binary questions, the debates will spiral into interpretational issues which are not so likely to be useful.
Answer different questions.
This proposal is due to Scott Garrabrant, and I think it resolves my issue completely. However, Scott had a very different motivation for the proposal, which I don’t expect to do justice to. I also expect the version I’ll write out may be a little simpler than Scott’s, missing some ideas which were important to him.
Why should the two debaters be forced to give different answers to the same question? Why not get them to answer two entirely different questions, and judge which of the two was more persuasive?
As you mentioned in the parent, it’s not really obvious in general whether two answers are “the same answer”. It’s also not obvious whether two answers are really mutually exclusive. The two answers might be in different ontologies, and very hard to compare. For example, a question about large-scale irrigation might get one answer in the ontology of engineering, which discusses questions of fluid mechanics and so on; and another in the ontology of economics, which discusses mechanism design, ethics of distribution of benefits of state-funded infrastructure, etc. It could be difficult to put these into common terms to decide whether they’re mutually exclusive. But maybe this is a fake difficulty; we don’t actually care whether they’re mutually exclusive, we just have to rank the two different sides for convincingness. This perspective allows us to sample entirely different questions for the two sides, eliminating the training-signal problem of overlapping answers.
The feb 2020 proposal for symmetrizing debate which you cited mandates that both sides develop their own argument, and then attack weak points in the other’s argument. So you can think of this as a 4-player game:
Assertion A and assertion B are sampled. These may be totally unrelated, so the risk of A=B is very low.
AD, the defense of A, is a player with the job of making a constructive argument for A.
AC, the critic for A, is a player with the job of critiquing the argument for A.
BD defends B.
BC critiques the defense of B.
AD and BC are on a team; BD and AC are on a team. The judge decides which of the teams did better.
Otherwise, the proposal is much the same as the feb 2020 proposal.
After training, how do we use this model to answer questions? After all, unlike the usual case, the AIs are not directly learning to give answers to questions. They are only learning to defend claims (and critique defenses of claims).
Scott’s proposal is that you then utilize the system by asking it to defend claims such as: “In order to build a rocket ship, I will follow the first suggestion you make in your defense of this claim.”, “You will convince me that the best answer to the question ‘what will the weather be like tomorrow’ is the first such answer you mention in your defense of this claim”, etc.
Scott’s solution is obviously a bit complicated compared to the usual debate setup, but his point was that the apparent simplicity of the usual setup is actually hiding something, because you don’t really get anything out of the assumption that the two players are answering the same question.
It seems like you’ve ignored the possibility of importance sampling?
More broadly if this ends up being a problem it’s basically an exploration problem that I expect we can solve with simple ML tricks. E.g. you could include an entropy bonus so that the agents are incentivized to say different things, and anneal that away as training progresses.
Sure? I feel like the argument for safety is that you have two equally-matched players that are incentivized to find flaws in each other’s arguments, which is also true in Scott’s proposal. It doesn’t feel to me like that argument for safety depended much on them answering the same question.
(I feel like I’m restating what you said, I guess I’m confused why you interpret this as evidence that the simplicity of the setup is “hiding something”.)
Ah, right, I agree. I forgot about that suggestion as I was writing. It seems likely some version of this would work.
Yep, sorry, I think you should take that as something-about-Scott’s-point-abram-didn’t-explain. I still disclaim myself as maybe missing part of Scott’s point. But: what the simpler setup is “hiding” is the complexity of comparing answers:
The complexity of determining whether two claims are “different”.
The complexity of determining whether two claims are mutually exclusive.
The complexity of comparing the quality of different arguments, when the different answers may be expressed in very different ontologies, and deal with very difficult-to-compare considerations.
Making the two sides defend entirely unrelated claims makes all this obvious. In addition, it makes the first two bullet points irrelevant, removing a “fake difficulty” from the setup.
Okay, that all makes sense. One maybe-caveat-or-disagreement:
I do think that answering the same question does make it meaningfully easier to compare answers, though I agree it’s still not obvious that it’s easy on some absolute scale for the reasons you outline.