Nice, an excuse to apply my love for Computational Complexity to useful questions of AI Safety!
That being said, I’m a little confused with what you show.
Second, we’ll say that a proposal to train a model M using a loss function LM accesses a complexity class C iff, for any decision problem p∈C, there exists some strategy available to H such that, for any M which is optimal under LM given H’s strategy, M(p) always solves p.
I follow up to there. What you’re showing is basically that you can train M to solve any problem in C using a specific alignment approach, without limiting in any way the computational power of M. So it might take an untractable amount of resources, like exponential time, for this model to solve a problem in PSPACE, but what matters here is just that it does. The point is to show that alignment approaches using a polynomial-time human can solve these problems, not how much resources they will use to do so.
And then you write about the intuition:
Thus, conceptually, a proposal accesses C if there is a (polynomial-time) strategy you can implement such that—conditional on you knowing that the model is optimal—you would trust its output for any problem in C.
Maybe it’s just grammar, but I read this sentence as saying that I trust the output of the polynomial-time strategy. And thus that you can solve PSPACE, NEXP, EXP and R problems in polynomial time. So I’m assuming that you mean trusting the model, which once again has no limits in terms of resources used.
Irving et al. actually note that, if you don’t imagine optimal play and simply restrict to the set of problems that a polynomial-time human can actually judge, debate only reaches NP rather than PSPACE.
I looked for that statement in the paper, failed to find it, then realized you probably meant that raw polynomial time verification gives you NP (the certificate version of NP, basically). Riffing on the importance of optimal play, Irving et al. show that the debate game is a game in the complexity theoretic sense, and thus that it is equivalent to TQBF, a PSPACE-complete problem. But when seeing a closed formula as a game, the decision problem of finding whether it’s in TQBF amounts to showing the existence of a winning strategy for the first player. Debate solves this by assuming optimal play, and thus that the winning debater will have, find, and apply a winning strategy for the debate.
I find it fun to compare it to IP, the class of Interactive Protocols, which is also equal to PSPACE. An interactive protocol makes a powerful prover convince a polynomial time verifier of the existence of a winning strategy for the problem (TQBF for example). As Scott Aarronson writes in “Quantum Computing Since Democritus”:
I won’t go through Shamir’s result here, but this means that, if a super-intelligent alien came to Earth, it could prove to us whether white or black has the winning strategy in chess, or if chess is a draw.
This seems more powerful than debate, like a meta level above. Yet it isn’t, and it makes sense: showing you that there’s a winning strategy is powerful, but having a guarantee to always win if there is a winning strategy is almost as good.
With all that, I’ve not gotten into your proofs at all. I’m reading them in details, but I’ll have to take a bit more time before having constructive comments on them. ^^
A notation question I can already ask though: what is M(q)|H(M,q) ?
I follow up to there. What you’re showing is basically that you can train M to solve any problem in C using a specific alignment approach, without limiting in any way the computational power of M. So it might take an untractable amount of resources, like exponential time, for this model to solve a problem in PSPACE, but what matters here is just that it does. The point is to show that alignment approaches using a polynomial-time human can solve these problems, not how much resources they will use to do so.
Yep—that’s right.
Maybe it’s just grammar, but I read this sentence as saying that I trust the output of the polynomial-time strategy. And thus that you can solve PSPACE, NEXP, EXP and R problems in polynomial time. So I’m assuming that you mean trusting the model, which once again has no limits in terms of resources used.
I certainly don’t mean that the model needs to be polynomial time. I edited the post to try to clarify this point.
I looked for that statement in the paper, failed to find it, then realized you probably meant that raw polynomial time verification gives you NP (the certificate version of NP, basically). Riffing on the importance of optimal play, Irving et al. show that the debate game is a game in the complexity theoretic sense, and thus that it is equivalent to TQBF, a PSPACE-complete problem. But when seeing a closed formula as a game, the decision problem of finding whether it’s in TQBF amounts to showing the existence of a winning strategy for the first player. Debate solves this by assuming optimal play, and thus that the winning debater will have, find, and apply a winning strategy for the debate.
Yeah—that’s my interpretation of the debate paper as well.
Nice, an excuse to apply my love for Computational Complexity to useful questions of AI Safety!
That being said, I’m a little confused with what you show.
I follow up to there. What you’re showing is basically that you can train M to solve any problem in C using a specific alignment approach, without limiting in any way the computational power of M. So it might take an untractable amount of resources, like exponential time, for this model to solve a problem in PSPACE, but what matters here is just that it does. The point is to show that alignment approaches using a polynomial-time human can solve these problems, not how much resources they will use to do so.
And then you write about the intuition:
Maybe it’s just grammar, but I read this sentence as saying that I trust the output of the polynomial-time strategy. And thus that you can solve PSPACE, NEXP, EXP and R problems in polynomial time. So I’m assuming that you mean trusting the model, which once again has no limits in terms of resources used.
I looked for that statement in the paper, failed to find it, then realized you probably meant that raw polynomial time verification gives you NP (the certificate version of NP, basically). Riffing on the importance of optimal play, Irving et al. show that the debate game is a game in the complexity theoretic sense, and thus that it is equivalent to TQBF, a PSPACE-complete problem. But when seeing a closed formula as a game, the decision problem of finding whether it’s in TQBF amounts to showing the existence of a winning strategy for the first player. Debate solves this by assuming optimal play, and thus that the winning debater will have, find, and apply a winning strategy for the debate.
I find it fun to compare it to IP, the class of Interactive Protocols, which is also equal to PSPACE. An interactive protocol makes a powerful prover convince a polynomial time verifier of the existence of a winning strategy for the problem (TQBF for example). As Scott Aarronson writes in “Quantum Computing Since Democritus”:
This seems more powerful than debate, like a meta level above. Yet it isn’t, and it makes sense: showing you that there’s a winning strategy is powerful, but having a guarantee to always win if there is a winning strategy is almost as good.
With all that, I’ve not gotten into your proofs at all. I’m reading them in details, but I’ll have to take a bit more time before having constructive comments on them. ^^
A notation question I can already ask though: what is M(q)|H(M,q) ?
See my comment that attempted to figure this out.
According to me, the point is that you can reduce “trusting the model” to “trusting that the model has reached the optimal equilibrium”.
Glad you found the post exciting!
Yep—that’s right.
I certainly don’t mean that the model needs to be polynomial time. I edited the post to try to clarify this point.
Yeah—that’s my interpretation of the debate paper as well.