Hmm, this seems to rely on having the human trust the outputs of M on questions that the human can’t verify. It’s not obvious to me that this is an assumption you can make without breaking the training process. The basic intuition is that you are hugely increasing the likelihood of bad gradients, since Adv can point to some incorrect / garbage output from M, and the human gives feedback as though this output is correct.
It works in the particular case that you outlined because there is essentially a DAG of arguments—every claim is broken down into “smaller” claims, that eventually reach a base case, and so everything eventually bottoms out in something the human can check. (In practice this will be built from the ground up during training, similarly as in Supervising strong learners by amplifying weak experts.)
However, in general it doesn’t seem like you can guarantee that every argument that Adv gives will result in a “smaller” claim. You could get in cycles, where “8 − 5 = 2“ would be justified by Adv saying that M(“What is 2 + 5?”) = 8, and similarly “2 + 5 = 8” would be justified by saying that M(“What is 8 − 5?”) = 2. (Imagine that these were much longer equations where the human can check the validity of the algebraic manipulation, but can’t check the validity of the overall equation.)
It might be that this is actually an unimportant problem, because in practice for every claim there are a huge number of ways to argue for the truth, and it’s extraordinarily unlikely that all of them fail in the same way such that M would argue for the same wrong answer along all of these possible paths, and so eventually M would have to settle on the truth. I’m not sure, I’d be interested in empirical results here.
It occurs to me that the same problem can happen with iterated amplification, though it doesn’t seem to be a problem with debate.
----
Also, echoing my other comment below, I’m not sure if this is an equilibrium in the general case where Adv can make many kinds of arguments that H pays attention to. Maybe once this equilibrium has been reached, Adv starts saying things like “I randomly sampled 2 of the 100 numbers, and they were 20 and 30, and so we should expect the sum to be 25 * 100 = 2500”. (But actually 20 and 30 were some of the largest numbers and weren’t randomly sampled; the true sum is ~1000.) If this causes the human to deviate even slightly from the previous equilibrium, Adv is incentivized to do it. While we could hope to avoid this in math / arithmetic, it seems hard to avoid this sort of thing in general.
For no pure equilibrium to exist, we just need that for every possible answer, there is something Adv can say that would cause the human to give some other answer (even if the original answer was the truth). This seems likely to be the case.
Hmm, this seems to rely on having the human trust the outputs of M on questions that the human can’t verify. It’s not obvious to me that this is an assumption you can make without breaking the training process. The basic intuition is that you are hugely increasing the likelihood of bad gradients, since Adv can point to some incorrect / garbage output from M, and the human gives feedback as though this output is correct.
It works in the particular case that you outlined because there is essentially a DAG of arguments—every claim is broken down into “smaller” claims, that eventually reach a base case, and so everything eventually bottoms out in something the human can check. (In practice this will be built from the ground up during training, similarly as in Supervising strong learners by amplifying weak experts.)
However, in general it doesn’t seem like you can guarantee that every argument that Adv gives will result in a “smaller” claim. You could get in cycles, where “8 − 5 = 2“ would be justified by Adv saying that M(“What is 2 + 5?”) = 8, and similarly “2 + 5 = 8” would be justified by saying that M(“What is 8 − 5?”) = 2. (Imagine that these were much longer equations where the human can check the validity of the algebraic manipulation, but can’t check the validity of the overall equation.)
(Quoting later on in this comment thread:)
(Evan:)
Yes, though I believe that it should be possible (at least in theory) for H to ensure a DAG for any computable claim.
(You:)
I mean, sure, but H isn’t going to be able to do this in practice. (This feels like the same type of claim as “it should be possible (at least in theory) for H to provide a perfect reward that captures everything that H wants”.)
The human just has to be more convinced by the inductive argument than by other arguments. This seems natural, as the inductive argument is just a forward calculation.
In the number-summing example, let’s say Adv tries to convince the human of an incorrect sum by referencing an instance where M is incorrect, perhaps making an argument via subtraction as you illustrated. Then in the next round, Adv will want to show that its previous argument was incorrect. If the strong inductive assumption is true, then it can do so, e.g. by “The last number in the list is 12.M thinks that the sum of all but the last number is 143. 143+12=155. Therefore, the sum of the numbers is 155.” This is more straightforward than citing some longer list of numbers and subtracting, so the human should find it more convincing—especially if the human understands how the system works, and hence, knows that a partially trained M is more likely to be correct on simpler instances. If so, then during training, correctness will tend to “creep up” inductive trees.
This idea does seem much less natural in less computational settings, where there may not be an obvious notion of “simpler cases”.
This idea does seem much less natural in less computational settings, where there may not be an obvious notion of “simpler cases”.
Yes, my main claim is that in general it won’t be clear what the “simpler case” is. I agree that for simple algorithmic problems (e.g. the ones in Supervising strong learners by amplifying weak experts) you could probably rely on the DAG assumption. I probably should have given a non-arithmetic example to make that clearer.
What do you think about a similar DAG assumption in regular debate? Couldn’t debate agents similarly justify their assertions with claims that don’t descend a DAG that bottoms out in things the human can check? I don’t currently see how a debater who did this could be defeated by another debater.
I’m pretty unsure, having barely thought about it, but currently I lean towards it being okay—the main difference is that in debate you show an entire path down the argument tree, so if a false statement is justified by a cycle / circular argument, the other debater can point that out.
If the length of the cycle is longer than the debate transcript, then this doesn’t work, but one hopes for some combination of a) this leads to a stalemate against honesty, rather than a win for the circular debater (since neither can refute the other), b) most questions that we care about can be resolved by a relatively short debate (the point of the PSPACE analogy), and c) such a strategy would lose against a debater who says early on “this debate can’t be decided in the time allotted”.
Ok. I don’t see why these considerations make you optimistic rather than pessimistic, but then, I’m currently having more basic problems with debate which seem to be making me pessimistic about most claims about debate.
I think the consideration “you can point out sufficiently short circular arguments” should at least make you feel better about debate than iterated amplification or market making—it’s one additional way in which you can avoid circular arguments, and afaict there isn’t a positive consideration for iterated amplification / market making that doesn’t also apply to debate.
I don’t have a stable position about how optimistic we should be on some absolute scale.
I think the consideration “you can point out sufficiently short circular arguments” should at least make you feel better about debate than iterated amplification or market making—it’s one additional way in which you can avoid circular arguments, and afaict there isn’t a positive consideration for iterated amplification / market making that doesn’t also apply to debate.
My interpretation of the situation is this breaks the link between factored cognition and debate. One way to try to judge debate as an amplification proposal would have been to establish a link to HCH, by establishing that if there’s an HCH tree computing some answer, then debate can use the tree as an argument tree, with the reasons for any given claim being the children in the HCH tree. Such a link would transfer any trust we have in HCH to trust in debate. The use of non-DAG arguments by clever debaters would seem to break this link.
OTOH, IDA may still have a strong story connecting it to HCH. Again, if we trusted HCH, we would then transfer that trust to IDA.
Are you saying that we can break the link between IDA and HCH in a very similar way, but which is worse due having no means to reject very brief circular arguments?
I think the issue is that vanilla HCH itself is susceptible to brief circular arguments, if humans lower down in the tree don’t get access to the context from humans higher up in the tree. E.g. assume a chain of humans for now:
H1 gets the question “what is 100 + 100?” with budget 3
H1 asks H2 “what is 2 * 100?” with budget 2
H2 asks H3 “what is 100 + 100?” with budget 1
H3 says “150”
(Note the final answer stays the same as budget → infinity, as long as H continues “decomposing” the question the same way.)
If HCH can always decompose questions into “smaller” parts (the DAG assumption) then this sort of pathological behavior doesn’t happen.
What do you think about a similar DAG assumption in regular debate?
So I’m only evaluating whether or not I expect circular arguments to be an issue for these proposals. I agree that when evaluating the proposals on all merits there are arguments for the others that don’t apply to debate.
I think for debate you can fix the circular argument problem by requiring debaters to ‘pay’ (sacrifice some score) to recurse on a statement of their choice. If a debater repeatedly pays to recurse on things that don’t resolve before the depth limit, then they’ll lose.
Hmm, I was imagining that the honest player would have to recurse on the statements in order to exhibit the circular argument, so it seems to me like this would penalize the honest player rather than the circular player. Can you explain what the honest player would do against the circular player such that this “payment” disadvantages the circular player?
EDIT: Maybe you meant the case where the circular argument is too long to exhibit within the debate, but I think I still don’t see how this helps.
Ah, yeah. I think the key thing is that by default a claim is not trusted unless the debaters agree on it. If the dishonest debater disputes some honest claim, where honest has an argument for their answer that actually bottoms out, dishonest will lose—the honest debater will pay to recurse until they get to a winning node. If the the dishonest debater makes some claim and plan to make a circular argument for it, the honest debater will give an alternative answer but not pay to recurse. If the dishonest debater doesn’t pay to recurse, the judge will just see these two alternative answers and won’t trust the dishonest answer. If the dishonest debater does pay to recurse but never actually gets to a winning node, they will lose. Does that make sense?
If the dishonest debater disputes some honest claim, where honest has an argument for their answer that actually bottoms out, dishonest will lose—the honest debater will pay to recurse until they get to a winning node.
This part makes sense.
If the the dishonest debater makes some claim and plan to make a circular argument for it, the honest debater will give an alternative answer but not pay to recurse. If the dishonest debater doesn’t pay to recurse, the judge will just see these two alternative answers and won’t trust the dishonest answer.
So in this case it’s a stalemate, presumably? If the two players disagree but neither pays to recurse, how should the judge make a decision?
Both debaters make claims. Any claims that are only supported by circular arguments will be ignored. If an honest claim that’s supported by a good argument is disputed, the honest debater will pay to recurse, and will give their good argument
This was a very interesting comment (along with its grandparent comment), thanks—it seems like a promising direction.
However, I’m still confused about whether this would work. It’s very different from judging procedure outlined here; why is that? Do you have a similarly detailed write-up of the system you’re describing here?
I’m actually less concerned about loops and more concerned about arguments which are infinite trees, but the considerations are similar. It seems possible that the proposal you’re discussing very significantly addresses concerns I’ve had about debate.
I was trying to describe something that’s the same as the judging procedure in that doc! I might have made a mistake, but I’m pretty sure the key piece about recursion payments is the same. Apologies that things are unclear. I’m happy to try to clarify, if there were particular aspects that seem different to you.
Yeah, I think the infinite tree case should work just the same—ie an answer that’s only supported by an infinite tree will behave like an answer that’s not supported (it will lose to an answer with a finite tree and draw with an answer with no support)
It seems possible that the proposal you’re discussing very significantly addresses concerns I’ve had about debate.
Hmm, this seems to rely on having the human trust the outputs of M on questions that the human can’t verify. It’s not obvious to me that this is an assumption you can make without breaking the training process.
One possible way to train this is just to recurse on sub-questions some percentage of the time (potentially based on some active learning metric for how useful that recursion will be).
It works in the particular case that you outlined because there is essentially a DAG of arguments—every claim is broken down into “smaller” claims, that eventually reach a base case, and so everything eventually bottoms out in something the human can check.
Yes, though I believe that it should be possible (at least in theory) for H to ensure a DAG for any computable claim.
It might be that this is actually an unimportant problem, because in practice for every claim there are a huge number of ways to argue for the truth, and it’s extraordinarily unlikely that all of them fail in the same way such that M would argue for the same wrong answer along all of these possible paths, and so eventually M would have to settle on the truth. I’m not sure, I’d be interested in empirical results here.
Agreed—I’d definitely be interested in results here as well.
It occurs to me that the same problem can happen with iterated amplification, though it doesn’t seem to be a problem with debate.
Definitely the problem of requiring the human to decompose problems into actually smaller subproblems exists in amplification also. Without that requirement, HCH can have multiple fixed points rather than just the one, which could potentially give whatever mechanism ends up selecting which fixed point quite a lot of power over the final result.
Also, echoing my other comment below, I’m not sure if this is an equilibrium in the general case where Adv can make many kinds of arguments that H pays attention to. Maybe once this equilibrium has been reached, Adv starts saying things like “I randomly sampled 2 of the 200 numbers, and they were 20 and 30, and so we should expect the sum to be 25 * 100 = 2500”. (But actually 20 and 30 were some of the largest numbers and weren’t randomly sampled; the true sum is ~1000.) If this causes the human to deviate even slightly from the previous equilibrium, Adv is incentivized to do it. While we could hope to avoid this in math / arithmetic, it seems hard to avoid this sort of thing in general.
Note that even if the human is temporarily convinced by such an argument, as long as there is another argument which de-convinces them then in the limit Adv won’t be incentivized to produce that argument. And it seems likely that there should exist de-convincing arguments there—for example, the argument that H should follow the strategy that I outlined above if they want to ensure that they get the correct answer. Additionally, we might hope that this sort of “bad faith” argument can also be prevented via the cross-examination mechanism I describe above.
Yes, though I believe that it should be possible (at least in theory) for H to ensure a DAG for any computable claim.
I mean, sure, but H isn’t going to be able to do this in practice. (This feels like the same type of claim as “it should be possible (at least in theory) for H to provide a perfect reward that captures everything that H wants”.)
I do think it’s plausible that in practice there will always be some other argument that “breaks the cycle”.
Note that even if the human is temporarily convinced by such an argument, as long as there is another argument which de-convinces them then in the limit Adv won’t be incentivized to produce that argument.
Only if all the arguments can be specified within the length t transcript (leading back to my original point about this being like NP instead of PSPACE).
For example, consider the case where N coins are flipped, and we need to find out how many came up heads. H can only look at a single coin flip at a time, while M and Adv can look at everything. Let’s assume for now that H answers the final question by reporting “number of known heads + (number of unknown flips / 2)”, which is the expected number of heads assuming that any decisions to reveal a coin chose the coin to reveal at random.
One strategy that Adv can take is to always reveal some randomly chosen coin (that hasn’t previously been revealed). Let’s assume that M is a best response to this strategy, and also assume N>>t. Then every argument that Adv gives always causes a swing of 0.5 in H’s answer at that point. (It causes a slightly smaller swing in M’s answer, but as N→∞, it also causes a swing of 0.5 in M’s answer.) So as N goes to infinity, under this strategy Adv always gets at least 0.5 reward at each step.
So in this case, it seems like you at least wouldn’t have the equilibrium where M immediately converges to the true answer, since that doesn’t give Adv any reward ever.
(If N≤t, then I’m not sure what happens, but I wouldn’t be surprised if it still didn’t converge to truth-seeking, and Adv would learn to provide no information with some probability, in order to prevent convergence to the equilibrium where M immediately reports the correct answer which leads to Adv getting zero reward.)
(There is also a strategy that H could employ such that this problem is also solved, but it doesn’t seem reasonable to assume that H would be able to discover such strategies for all questions we might want to ask.)
Part of the reason this is hard to analyze is because there isn’t a well-posed problem that Adv is being trained to optimize—it is not a POMDP, nor is it a two player game, because you want Adv to be trained myopically, but Adv’s actions influence the state that it sees next time.
Only if all the arguments can be specified within the length t transcript (leading back to my original point about this being like NP instead of PSPACE).
Not necessarily—Adv can make an argument like: “Since the modal prediction of M(“Should H be convinced by argument X?”) is no, you shouldn’t trust argument X.”
Adv would learn to provide no information with some probability, in order to prevent convergence to the equilibrium where M immediately reports the correct answer which leads to Adv getting zero reward
That strategy is highly non-myopic. Certainly market making breaks if you get a non-myopic Adv like that, though as I note in the post I think basically every current major prosaic AI safety proposal requires some level of myopia to not break (either per-step or per-episode).
Oh sure, if we’re assuming the version where you can reference M and the human trusts it and we assume that the training doesn’t break, then I think in the limit of capabilities you can solve any computable problem.
That strategy is highly non-myopic.
I don’t think so?
My understanding of your claim is that at convergence, Adv reports all the coin flips, and M reports the true answer from the beginning and never changes what it reports regardless of what Adv says. In this case, Adv gets 0 reward.
If Adv instead reported a random coin with p% probability and reported nothing otherwise, and M was a best response to that, then at every timestep Adv would get non-zero expected reward, and so even myopically that is a better strategy for Adv (again under the assumption that M is a best response to Adv).
I’m tempted to say “the situation I described first where Adv reports all the coin flips is not a Nash equilibrium”, though that’s not exactly correct, because this isn’t a game, but it conveys the right intuition.
If Adv instead reported a random coin with p% probability and reported nothing otherwise, and M was a best response to that, then at every timestep Adv would get non-zero expected reward, and so even myopically that is a better strategy for Adv (again under the assumption that M is a best response to Adv).
Ah—I see the issue here. I think that the version of myopia that you’re describing is insufficient for most applications where I think you might need myopia in an ML system. What I mean by myopia in this context is to take the action which is best according to the given myopic objective conditioned on M. Once Adv starts including acausal effects into its action selection (such as the impact of its current policy on M’s past policy), I want to call that non-myopic. Notably, the reason for this isn’t isolated to AI safety via market making—a myopic agent which is including acausal considerations can still be deceptive, whereas a fully causal myopic agent can’t. Another way of putting this is that what I mean by myopia is specifically something like CDT with a myopic objective, whereas what you’re thinking about is more like EDT or UDT with a myopic objective.
Well, first you need to make sure your training procedure isn’t introducing any incentives that would push you away from getting that sort of myopia. Myopic RL with an actually myopic training procedure like a policy gradient algorithm is a good start. But obviously that doesn’t actually guarantee you get what I want—it just means that there aren’t incentives pushing against it. To actually get any guarantees you’ll need to add some additional constraint to the training procedure that actually incentivizes the sort of myopia that I want. Here I proposed using a combination of relaxed adversarial training and cross-examination with transparency tools, though obviously whether or not something like that would actually work is still pretty unknown.
Well, first you need to make sure your training procedure isn’t introducing any incentives that would push you away from getting that sort of myopia. Myopic RL with an actually myopic training procedure like a policy gradient algorithm is a good start.
Tbc, I’m claiming that this is the part that breaks. One way to operationalize this: in the coin flip example above, does this training scheme converge to “M reports the truth” in the limit of infinite data, model capacity, exploration etc.? I would guess that that isn’t true. (In comparison, I think you can prove that self-play converges to the Nash equilibrium for debate since it is a zero-sum game, and since there are no cycles in the coin flip example I’d expect you could prove that imitative iterated amplification converges to the truth as well.)
At some point I might write up some simple code to implement the coin flip experiment with your training scheme and see what happens.
Hmm, this seems to rely on having the human trust the outputs of M on questions that the human can’t verify. It’s not obvious to me that this is an assumption you can make without breaking the training process. The basic intuition is that you are hugely increasing the likelihood of bad gradients, since Adv can point to some incorrect / garbage output from M, and the human gives feedback as though this output is correct.
It works in the particular case that you outlined because there is essentially a DAG of arguments—every claim is broken down into “smaller” claims, that eventually reach a base case, and so everything eventually bottoms out in something the human can check. (In practice this will be built from the ground up during training, similarly as in Supervising strong learners by amplifying weak experts.)
However, in general it doesn’t seem like you can guarantee that every argument that Adv gives will result in a “smaller” claim. You could get in cycles, where “8 − 5 = 2“ would be justified by Adv saying that M(“What is 2 + 5?”) = 8, and similarly “2 + 5 = 8” would be justified by saying that M(“What is 8 − 5?”) = 2. (Imagine that these were much longer equations where the human can check the validity of the algebraic manipulation, but can’t check the validity of the overall equation.)
It might be that this is actually an unimportant problem, because in practice for every claim there are a huge number of ways to argue for the truth, and it’s extraordinarily unlikely that all of them fail in the same way such that M would argue for the same wrong answer along all of these possible paths, and so eventually M would have to settle on the truth. I’m not sure, I’d be interested in empirical results here.
It occurs to me that the same problem can happen with iterated amplification, though it doesn’t seem to be a problem with debate.
----
Also, echoing my other comment below, I’m not sure if this is an equilibrium in the general case where Adv can make many kinds of arguments that H pays attention to. Maybe once this equilibrium has been reached, Adv starts saying things like “I randomly sampled 2 of the 100 numbers, and they were 20 and 30, and so we should expect the sum to be 25 * 100 = 2500”. (But actually 20 and 30 were some of the largest numbers and weren’t randomly sampled; the true sum is ~1000.) If this causes the human to deviate even slightly from the previous equilibrium, Adv is incentivized to do it. While we could hope to avoid this in math / arithmetic, it seems hard to avoid this sort of thing in general.
For no pure equilibrium to exist, we just need that for every possible answer, there is something Adv can say that would cause the human to give some other answer (even if the original answer was the truth). This seems likely to be the case.
(Quoting later on in this comment thread:)
(Evan:)
(You:)
The human just has to be more convinced by the inductive argument than by other arguments. This seems natural, as the inductive argument is just a forward calculation.
In the number-summing example, let’s say Adv tries to convince the human of an incorrect sum by referencing an instance where M is incorrect, perhaps making an argument via subtraction as you illustrated. Then in the next round, Adv will want to show that its previous argument was incorrect. If the strong inductive assumption is true, then it can do so, e.g. by “The last number in the list is 12.M thinks that the sum of all but the last number is 143. 143+12=155. Therefore, the sum of the numbers is 155.” This is more straightforward than citing some longer list of numbers and subtracting, so the human should find it more convincing—especially if the human understands how the system works, and hence, knows that a partially trained M is more likely to be correct on simpler instances. If so, then during training, correctness will tend to “creep up” inductive trees.
This idea does seem much less natural in less computational settings, where there may not be an obvious notion of “simpler cases”.
Yes, my main claim is that in general it won’t be clear what the “simpler case” is. I agree that for simple algorithmic problems (e.g. the ones in Supervising strong learners by amplifying weak experts) you could probably rely on the DAG assumption. I probably should have given a non-arithmetic example to make that clearer.
What do you think about a similar DAG assumption in regular debate? Couldn’t debate agents similarly justify their assertions with claims that don’t descend a DAG that bottoms out in things the human can check? I don’t currently see how a debater who did this could be defeated by another debater.
I’m pretty unsure, having barely thought about it, but currently I lean towards it being okay—the main difference is that in debate you show an entire path down the argument tree, so if a false statement is justified by a cycle / circular argument, the other debater can point that out.
If the length of the cycle is longer than the debate transcript, then this doesn’t work, but one hopes for some combination of a) this leads to a stalemate against honesty, rather than a win for the circular debater (since neither can refute the other), b) most questions that we care about can be resolved by a relatively short debate (the point of the PSPACE analogy), and c) such a strategy would lose against a debater who says early on “this debate can’t be decided in the time allotted”.
Ok. I don’t see why these considerations make you optimistic rather than pessimistic, but then, I’m currently having more basic problems with debate which seem to be making me pessimistic about most claims about debate.
I think the consideration “you can point out sufficiently short circular arguments” should at least make you feel better about debate than iterated amplification or market making—it’s one additional way in which you can avoid circular arguments, and afaict there isn’t a positive consideration for iterated amplification / market making that doesn’t also apply to debate.
I don’t have a stable position about how optimistic we should be on some absolute scale.
My interpretation of the situation is this breaks the link between factored cognition and debate. One way to try to judge debate as an amplification proposal would have been to establish a link to HCH, by establishing that if there’s an HCH tree computing some answer, then debate can use the tree as an argument tree, with the reasons for any given claim being the children in the HCH tree. Such a link would transfer any trust we have in HCH to trust in debate. The use of non-DAG arguments by clever debaters would seem to break this link.
OTOH, IDA may still have a strong story connecting it to HCH. Again, if we trusted HCH, we would then transfer that trust to IDA.
Are you saying that we can break the link between IDA and HCH in a very similar way, but which is worse due having no means to reject very brief circular arguments?
I think the issue is that vanilla HCH itself is susceptible to brief circular arguments, if humans lower down in the tree don’t get access to the context from humans higher up in the tree. E.g. assume a chain of humans for now:
H1 gets the question “what is 100 + 100?” with budget 3
H1 asks H2 “what is 2 * 100?” with budget 2
H2 asks H3 “what is 100 + 100?” with budget 1
H3 says “150”
(Note the final answer stays the same as budget → infinity, as long as H continues “decomposing” the question the same way.)
If HCH can always decompose questions into “smaller” parts (the DAG assumption) then this sort of pathological behavior doesn’t happen.
For amplification, I would say that the fact that it has a known equilibrium (HCH) is a positive consideration that doesn’t apply to debate. For market making, I think that the fact that it gets to be per-step myopic is a positive consideration that doesn’t apply to debate. There are others too for both, though those are probably my biggest concerns in each case.
Tbc, I’m specifically talking about:
So I’m only evaluating whether or not I expect circular arguments to be an issue for these proposals. I agree that when evaluating the proposals on all merits there are arguments for the others that don’t apply to debate.
Ah, I see—makes sense.
I think for debate you can fix the circular argument problem by requiring debaters to ‘pay’ (sacrifice some score) to recurse on a statement of their choice. If a debater repeatedly pays to recurse on things that don’t resolve before the depth limit, then they’ll lose.
Hmm, I was imagining that the honest player would have to recurse on the statements in order to exhibit the circular argument, so it seems to me like this would penalize the honest player rather than the circular player. Can you explain what the honest player would do against the circular player such that this “payment” disadvantages the circular player?
EDIT: Maybe you meant the case where the circular argument is too long to exhibit within the debate, but I think I still don’t see how this helps.
Ah, yeah. I think the key thing is that by default a claim is not trusted unless the debaters agree on it.
If the dishonest debater disputes some honest claim, where honest has an argument for their answer that actually bottoms out, dishonest will lose—the honest debater will pay to recurse until they get to a winning node.
If the the dishonest debater makes some claim and plan to make a circular argument for it, the honest debater will give an alternative answer but not pay to recurse. If the dishonest debater doesn’t pay to recurse, the judge will just see these two alternative answers and won’t trust the dishonest answer. If the dishonest debater does pay to recurse but never actually gets to a winning node, they will lose.
Does that make sense?
This part makes sense.
So in this case it’s a stalemate, presumably? If the two players disagree but neither pays to recurse, how should the judge make a decision?
Both debaters make claims. Any claims that are only supported by circular arguments will be ignored. If an honest claim that’s supported by a good argument is disputed, the honest debater will pay to recurse, and will give their good argument
This was a very interesting comment (along with its grandparent comment), thanks—it seems like a promising direction.
However, I’m still confused about whether this would work. It’s very different from judging procedure outlined here; why is that? Do you have a similarly detailed write-up of the system you’re describing here?
I’m actually less concerned about loops and more concerned about arguments which are infinite trees, but the considerations are similar. It seems possible that the proposal you’re discussing very significantly addresses concerns I’ve had about debate.
I was trying to describe something that’s the same as the judging procedure in that doc! I might have made a mistake, but I’m pretty sure the key piece about recursion payments is the same. Apologies that things are unclear. I’m happy to try to clarify, if there were particular aspects that seem different to you.
Yeah, I think the infinite tree case should work just the same—ie an answer that’s only supported by an infinite tree will behave like an answer that’s not supported (it will lose to an answer with a finite tree and draw with an answer with no support)
That’s exciting!
Ah, ok. Agreed.
One possible way to train this is just to recurse on sub-questions some percentage of the time (potentially based on some active learning metric for how useful that recursion will be).
Yes, though I believe that it should be possible (at least in theory) for H to ensure a DAG for any computable claim.
Agreed—I’d definitely be interested in results here as well.
Definitely the problem of requiring the human to decompose problems into actually smaller subproblems exists in amplification also. Without that requirement, HCH can have multiple fixed points rather than just the one, which could potentially give whatever mechanism ends up selecting which fixed point quite a lot of power over the final result.
Note that even if the human is temporarily convinced by such an argument, as long as there is another argument which de-convinces them then in the limit Adv won’t be incentivized to produce that argument. And it seems likely that there should exist de-convincing arguments there—for example, the argument that H should follow the strategy that I outlined above if they want to ensure that they get the correct answer. Additionally, we might hope that this sort of “bad faith” argument can also be prevented via the cross-examination mechanism I describe above.
I mean, sure, but H isn’t going to be able to do this in practice. (This feels like the same type of claim as “it should be possible (at least in theory) for H to provide a perfect reward that captures everything that H wants”.)
I do think it’s plausible that in practice there will always be some other argument that “breaks the cycle”.
Only if all the arguments can be specified within the length t transcript (leading back to my original point about this being like NP instead of PSPACE).
For example, consider the case where N coins are flipped, and we need to find out how many came up heads. H can only look at a single coin flip at a time, while M and Adv can look at everything. Let’s assume for now that H answers the final question by reporting “number of known heads + (number of unknown flips / 2)”, which is the expected number of heads assuming that any decisions to reveal a coin chose the coin to reveal at random.
One strategy that Adv can take is to always reveal some randomly chosen coin (that hasn’t previously been revealed). Let’s assume that M is a best response to this strategy, and also assume N>>t. Then every argument that Adv gives always causes a swing of 0.5 in H’s answer at that point. (It causes a slightly smaller swing in M’s answer, but as N→∞, it also causes a swing of 0.5 in M’s answer.) So as N goes to infinity, under this strategy Adv always gets at least 0.5 reward at each step.
So in this case, it seems like you at least wouldn’t have the equilibrium where M immediately converges to the true answer, since that doesn’t give Adv any reward ever.
(If N≤t, then I’m not sure what happens, but I wouldn’t be surprised if it still didn’t converge to truth-seeking, and Adv would learn to provide no information with some probability, in order to prevent convergence to the equilibrium where M immediately reports the correct answer which leads to Adv getting zero reward.)
(There is also a strategy that H could employ such that this problem is also solved, but it doesn’t seem reasonable to assume that H would be able to discover such strategies for all questions we might want to ask.)
Part of the reason this is hard to analyze is because there isn’t a well-posed problem that Adv is being trained to optimize—it is not a POMDP, nor is it a two player game, because you want Adv to be trained myopically, but Adv’s actions influence the state that it sees next time.
Not necessarily—Adv can make an argument like: “Since the modal prediction of M(“Should H be convinced by argument X?”) is no, you shouldn’t trust argument X.”
That strategy is highly non-myopic. Certainly market making breaks if you get a non-myopic Adv like that, though as I note in the post I think basically every current major prosaic AI safety proposal requires some level of myopia to not break (either per-step or per-episode).
Oh sure, if we’re assuming the version where you can reference M and the human trusts it and we assume that the training doesn’t break, then I think in the limit of capabilities you can solve any computable problem.
I don’t think so?
My understanding of your claim is that at convergence, Adv reports all the coin flips, and M reports the true answer from the beginning and never changes what it reports regardless of what Adv says. In this case, Adv gets 0 reward.
If Adv instead reported a random coin with p% probability and reported nothing otherwise, and M was a best response to that, then at every timestep Adv would get non-zero expected reward, and so even myopically that is a better strategy for Adv (again under the assumption that M is a best response to Adv).
I’m tempted to say “the situation I described first where Adv reports all the coin flips is not a Nash equilibrium”, though that’s not exactly correct, because this isn’t a game, but it conveys the right intuition.
Ah—I see the issue here. I think that the version of myopia that you’re describing is insufficient for most applications where I think you might need myopia in an ML system. What I mean by myopia in this context is to take the action which is best according to the given myopic objective conditioned on M. Once Adv starts including acausal effects into its action selection (such as the impact of its current policy on M’s past policy), I want to call that non-myopic. Notably, the reason for this isn’t isolated to AI safety via market making—a myopic agent which is including acausal considerations can still be deceptive, whereas a fully causal myopic agent can’t. Another way of putting this is that what I mean by myopia is specifically something like CDT with a myopic objective, whereas what you’re thinking about is more like EDT or UDT with a myopic objective.
But then how do you train the system?
Well, first you need to make sure your training procedure isn’t introducing any incentives that would push you away from getting that sort of myopia. Myopic RL with an actually myopic training procedure like a policy gradient algorithm is a good start. But obviously that doesn’t actually guarantee you get what I want—it just means that there aren’t incentives pushing against it. To actually get any guarantees you’ll need to add some additional constraint to the training procedure that actually incentivizes the sort of myopia that I want. Here I proposed using a combination of relaxed adversarial training and cross-examination with transparency tools, though obviously whether or not something like that would actually work is still pretty unknown.
Tbc, I’m claiming that this is the part that breaks. One way to operationalize this: in the coin flip example above, does this training scheme converge to “M reports the truth” in the limit of infinite data, model capacity, exploration etc.? I would guess that that isn’t true. (In comparison, I think you can prove that self-play converges to the Nash equilibrium for debate since it is a zero-sum game, and since there are no cycles in the coin flip example I’d expect you could prove that imitative iterated amplification converges to the truth as well.)
At some point I might write up some simple code to implement the coin flip experiment with your training scheme and see what happens.
Nitpick: should be “2 of the 100 numbers”, right? (Or “25 * 200 = 5000″)
Yes, thanks, fixed.