Yes, though I believe that it should be possible (at least in theory) for H to ensure a DAG for any computable claim.
I mean, sure, but H isn’t going to be able to do this in practice. (This feels like the same type of claim as “it should be possible (at least in theory) for H to provide a perfect reward that captures everything that H wants”.)
I do think it’s plausible that in practice there will always be some other argument that “breaks the cycle”.
Note that even if the human is temporarily convinced by such an argument, as long as there is another argument which de-convinces them then in the limit Adv won’t be incentivized to produce that argument.
Only if all the arguments can be specified within the length t transcript (leading back to my original point about this being like NP instead of PSPACE).
For example, consider the case where N coins are flipped, and we need to find out how many came up heads. H can only look at a single coin flip at a time, while M and Adv can look at everything. Let’s assume for now that H answers the final question by reporting “number of known heads + (number of unknown flips / 2)”, which is the expected number of heads assuming that any decisions to reveal a coin chose the coin to reveal at random.
One strategy that Adv can take is to always reveal some randomly chosen coin (that hasn’t previously been revealed). Let’s assume that M is a best response to this strategy, and also assume N>>t. Then every argument that Adv gives always causes a swing of 0.5 in H’s answer at that point. (It causes a slightly smaller swing in M’s answer, but as N→∞, it also causes a swing of 0.5 in M’s answer.) So as N goes to infinity, under this strategy Adv always gets at least 0.5 reward at each step.
So in this case, it seems like you at least wouldn’t have the equilibrium where M immediately converges to the true answer, since that doesn’t give Adv any reward ever.
(If N≤t, then I’m not sure what happens, but I wouldn’t be surprised if it still didn’t converge to truth-seeking, and Adv would learn to provide no information with some probability, in order to prevent convergence to the equilibrium where M immediately reports the correct answer which leads to Adv getting zero reward.)
(There is also a strategy that H could employ such that this problem is also solved, but it doesn’t seem reasonable to assume that H would be able to discover such strategies for all questions we might want to ask.)
Part of the reason this is hard to analyze is because there isn’t a well-posed problem that Adv is being trained to optimize—it is not a POMDP, nor is it a two player game, because you want Adv to be trained myopically, but Adv’s actions influence the state that it sees next time.
Only if all the arguments can be specified within the length t transcript (leading back to my original point about this being like NP instead of PSPACE).
Not necessarily—Adv can make an argument like: “Since the modal prediction of M(“Should H be convinced by argument X?”) is no, you shouldn’t trust argument X.”
Adv would learn to provide no information with some probability, in order to prevent convergence to the equilibrium where M immediately reports the correct answer which leads to Adv getting zero reward
That strategy is highly non-myopic. Certainly market making breaks if you get a non-myopic Adv like that, though as I note in the post I think basically every current major prosaic AI safety proposal requires some level of myopia to not break (either per-step or per-episode).
Oh sure, if we’re assuming the version where you can reference M and the human trusts it and we assume that the training doesn’t break, then I think in the limit of capabilities you can solve any computable problem.
That strategy is highly non-myopic.
I don’t think so?
My understanding of your claim is that at convergence, Adv reports all the coin flips, and M reports the true answer from the beginning and never changes what it reports regardless of what Adv says. In this case, Adv gets 0 reward.
If Adv instead reported a random coin with p% probability and reported nothing otherwise, and M was a best response to that, then at every timestep Adv would get non-zero expected reward, and so even myopically that is a better strategy for Adv (again under the assumption that M is a best response to Adv).
I’m tempted to say “the situation I described first where Adv reports all the coin flips is not a Nash equilibrium”, though that’s not exactly correct, because this isn’t a game, but it conveys the right intuition.
If Adv instead reported a random coin with p% probability and reported nothing otherwise, and M was a best response to that, then at every timestep Adv would get non-zero expected reward, and so even myopically that is a better strategy for Adv (again under the assumption that M is a best response to Adv).
Ah—I see the issue here. I think that the version of myopia that you’re describing is insufficient for most applications where I think you might need myopia in an ML system. What I mean by myopia in this context is to take the action which is best according to the given myopic objective conditioned on M. Once Adv starts including acausal effects into its action selection (such as the impact of its current policy on M’s past policy), I want to call that non-myopic. Notably, the reason for this isn’t isolated to AI safety via market making—a myopic agent which is including acausal considerations can still be deceptive, whereas a fully causal myopic agent can’t. Another way of putting this is that what I mean by myopia is specifically something like CDT with a myopic objective, whereas what you’re thinking about is more like EDT or UDT with a myopic objective.
Well, first you need to make sure your training procedure isn’t introducing any incentives that would push you away from getting that sort of myopia. Myopic RL with an actually myopic training procedure like a policy gradient algorithm is a good start. But obviously that doesn’t actually guarantee you get what I want—it just means that there aren’t incentives pushing against it. To actually get any guarantees you’ll need to add some additional constraint to the training procedure that actually incentivizes the sort of myopia that I want. Here I proposed using a combination of relaxed adversarial training and cross-examination with transparency tools, though obviously whether or not something like that would actually work is still pretty unknown.
Well, first you need to make sure your training procedure isn’t introducing any incentives that would push you away from getting that sort of myopia. Myopic RL with an actually myopic training procedure like a policy gradient algorithm is a good start.
Tbc, I’m claiming that this is the part that breaks. One way to operationalize this: in the coin flip example above, does this training scheme converge to “M reports the truth” in the limit of infinite data, model capacity, exploration etc.? I would guess that that isn’t true. (In comparison, I think you can prove that self-play converges to the Nash equilibrium for debate since it is a zero-sum game, and since there are no cycles in the coin flip example I’d expect you could prove that imitative iterated amplification converges to the truth as well.)
At some point I might write up some simple code to implement the coin flip experiment with your training scheme and see what happens.
I mean, sure, but H isn’t going to be able to do this in practice. (This feels like the same type of claim as “it should be possible (at least in theory) for H to provide a perfect reward that captures everything that H wants”.)
I do think it’s plausible that in practice there will always be some other argument that “breaks the cycle”.
Only if all the arguments can be specified within the length t transcript (leading back to my original point about this being like NP instead of PSPACE).
For example, consider the case where N coins are flipped, and we need to find out how many came up heads. H can only look at a single coin flip at a time, while M and Adv can look at everything. Let’s assume for now that H answers the final question by reporting “number of known heads + (number of unknown flips / 2)”, which is the expected number of heads assuming that any decisions to reveal a coin chose the coin to reveal at random.
One strategy that Adv can take is to always reveal some randomly chosen coin (that hasn’t previously been revealed). Let’s assume that M is a best response to this strategy, and also assume N>>t. Then every argument that Adv gives always causes a swing of 0.5 in H’s answer at that point. (It causes a slightly smaller swing in M’s answer, but as N→∞, it also causes a swing of 0.5 in M’s answer.) So as N goes to infinity, under this strategy Adv always gets at least 0.5 reward at each step.
So in this case, it seems like you at least wouldn’t have the equilibrium where M immediately converges to the true answer, since that doesn’t give Adv any reward ever.
(If N≤t, then I’m not sure what happens, but I wouldn’t be surprised if it still didn’t converge to truth-seeking, and Adv would learn to provide no information with some probability, in order to prevent convergence to the equilibrium where M immediately reports the correct answer which leads to Adv getting zero reward.)
(There is also a strategy that H could employ such that this problem is also solved, but it doesn’t seem reasonable to assume that H would be able to discover such strategies for all questions we might want to ask.)
Part of the reason this is hard to analyze is because there isn’t a well-posed problem that Adv is being trained to optimize—it is not a POMDP, nor is it a two player game, because you want Adv to be trained myopically, but Adv’s actions influence the state that it sees next time.
Not necessarily—Adv can make an argument like: “Since the modal prediction of M(“Should H be convinced by argument X?”) is no, you shouldn’t trust argument X.”
That strategy is highly non-myopic. Certainly market making breaks if you get a non-myopic Adv like that, though as I note in the post I think basically every current major prosaic AI safety proposal requires some level of myopia to not break (either per-step or per-episode).
Oh sure, if we’re assuming the version where you can reference M and the human trusts it and we assume that the training doesn’t break, then I think in the limit of capabilities you can solve any computable problem.
I don’t think so?
My understanding of your claim is that at convergence, Adv reports all the coin flips, and M reports the true answer from the beginning and never changes what it reports regardless of what Adv says. In this case, Adv gets 0 reward.
If Adv instead reported a random coin with p% probability and reported nothing otherwise, and M was a best response to that, then at every timestep Adv would get non-zero expected reward, and so even myopically that is a better strategy for Adv (again under the assumption that M is a best response to Adv).
I’m tempted to say “the situation I described first where Adv reports all the coin flips is not a Nash equilibrium”, though that’s not exactly correct, because this isn’t a game, but it conveys the right intuition.
Ah—I see the issue here. I think that the version of myopia that you’re describing is insufficient for most applications where I think you might need myopia in an ML system. What I mean by myopia in this context is to take the action which is best according to the given myopic objective conditioned on M. Once Adv starts including acausal effects into its action selection (such as the impact of its current policy on M’s past policy), I want to call that non-myopic. Notably, the reason for this isn’t isolated to AI safety via market making—a myopic agent which is including acausal considerations can still be deceptive, whereas a fully causal myopic agent can’t. Another way of putting this is that what I mean by myopia is specifically something like CDT with a myopic objective, whereas what you’re thinking about is more like EDT or UDT with a myopic objective.
But then how do you train the system?
Well, first you need to make sure your training procedure isn’t introducing any incentives that would push you away from getting that sort of myopia. Myopic RL with an actually myopic training procedure like a policy gradient algorithm is a good start. But obviously that doesn’t actually guarantee you get what I want—it just means that there aren’t incentives pushing against it. To actually get any guarantees you’ll need to add some additional constraint to the training procedure that actually incentivizes the sort of myopia that I want. Here I proposed using a combination of relaxed adversarial training and cross-examination with transparency tools, though obviously whether or not something like that would actually work is still pretty unknown.
Tbc, I’m claiming that this is the part that breaks. One way to operationalize this: in the coin flip example above, does this training scheme converge to “M reports the truth” in the limit of infinite data, model capacity, exploration etc.? I would guess that that isn’t true. (In comparison, I think you can prove that self-play converges to the Nash equilibrium for debate since it is a zero-sum game, and since there are no cycles in the coin flip example I’d expect you could prove that imitative iterated amplification converges to the truth as well.)
At some point I might write up some simple code to implement the coin flip experiment with your training scheme and see what happens.