It seems to me that the following argument should hold:
Premise 1: We can’t build physical machines that solve problems outside of P.
Premise 2: Recursive alignment proposals (HCH, debate, market-making, recursive reward modelling) at equilibrium would solve problems outside of P.
Conclusion 1, from premises 1 and 2: We can’t build physical machines that implement equilibrium policies for recursive alignment proposals.
Premise 3: Existing arguments for the safety of recursive alignment proposals reason about their equilibrium policies.
Conclusion 2, from conclusion 1 and premise 3: If we were to build physical machines that were trained using recursive alignment proposals, existing arguments would not establish the safety of the resulting policies.
Note that this assumes that P does not contain PSPACE.
Also, the proofs in the post (or at least the proof that iterative amplification with weak HCH accesses PSPACE, which is the only one I’ve read so far) assume polynomial-time access to M, which in reality contradicts premise 2. So I’m not sure what to make of things.
I’m confused by this comment. The post assumes that humans can solve all problems in P (in fact, all problems solvable in polynomial time given access to an oracle for M), then proves that various protocols can solve tricky problems by getting the human player to solve problems in P that in reality aren’t typical human problems to solve. Therefore, I take P to actually mean P, rather than being an analogy for problems humans can solve.
Also, regarding your version of premise 1, I think I buy that AI can only give you a polynomial speedup over humans.
I’ve always understood it as an analogy (e.g. from the debate paper, “Although debate is intended for use with fuzzy humans as judges, we can gain intuition about the model by replacing the human with an arbitrary polynomial time algorithm”). But if I take your perspective, then we have:
Premise 0: Humans can solve all problems in P.
Premise 1: We can’t build physical machines that solve problems outside of P.
Conclusion: We can’t build physical machines that can solve problems that humans can’t.
Seems like you should probably ditch one of the premises.
(I do think premise 0 is pretty wrong—I can solve arbitrary SAT problems that have < 10 clauses, despite it being NP-complete, but I can’t multiply a trillion numbers together, even though that is in P. Complexity classes impose scaling limits; humans have resource limits; these are very different. Yes, I do know that “SAT problems with < 10 clauses” is technically in P.)
Also, regarding your version of premise 1, I think I buy that AI can only give you a polynomial speedup over humans.
I think this is easily handled by saying “in practice, the models we train will not be literally optimal”. It’s still useful to know what would happen at optimality (e.g. I do in fact have more trust for deep RL algorithms that have proofs of convergence to optimal policies given infinite model capacity, compute and data).
I read the post as attempting to be literal, ctrl+F-ing “analog” doesn’t get me anything until the comments. Also, the post is the one that I read as assuming for the sake of analysis that humans can solve all problems in P, I myself wouldn’t necessarily assume that.
Also, regarding your version of premise 1, I think I buy that AI can only give you a polynomial speedup over humans.
I think this is easily handled by saying “in practice, the models we train will not be literally optimal”.
My guess is that the thing you mean is something like “Sure, the conclusion of the post is that optimal models can do more than a polynomial speedup over humans, but we won’t train optimal models, and in fact the things we train will just be a polynomial speedup over humans”, which is compatible with my argument in the top-level comment. But in general your comments make me think that you’re interpreting the post completely differently than I am. [EDIT: actually now that I re-read the second half of your comment it makes sense to me. I still am confused about what you think this post means.]
I mean, Evan can speak to what he meant. I strongly disagree with the literal interpretation of the post, regardless of what Evan meant, for the reason I gave above.
I still am confused about what you think this post means
I usually think of the analogies as demonstrating the mechanism by which a particular scheme is meant to outperform a human. I do find these proof-analogies less compelling than the original one in debate because they’re simulating Turing machines which is not how I expect any of them to work in practice. (In contrast, the debate proof for accessing PSPACE was about narrowing in on disagreements as being equivalent to finding a winning path of a two-player game, which is the canonical “structure” of a PSPACE-complete problem and is similar to why we’d expect debate to incentivize honesty in practice.)
I read the post as attempting to be literal, ctrl+F-ing “analog” doesn’t get me anything until the comments. Also, the post is the one that I read as assuming for the sake of analysis that humans can solve all problems in P, I myself wouldn’t necessarily assume that.
I mean, I assumed what I needed to in order to be able to do the proofs and have them make sense. What the proofs actually mean in practice is obviously up for debate, but I think that a pretty reasonable interpretation is that they’re something like analogies which help us get a handle on how powerful the different proposals are in theory.
What the proofs actually mean in practice is obviously up for debate, but I think that a pretty reasonable interpretation is that they’re something like analogies which help us get a handle on how powerful the different proposals are in theory.
I’m curious if you agree with the inference of conclusions 1 and 2 from premises 1, 2, and 3, and/or the underlying claim that it’s bad news to learn that your alignment scheme would be able to solve a very large complexity class.
I agree with the gist that it implies that arguments about the equilibrium policy don’t necessarily translate to real models, though I disagree that that’s necessarily bad news for the alignment scheme—it just means you need to find some guarantees that work even when you’re not at equilibrium.
The post assumes that humans can solve all problems in P (in fact, all problems solvable in polynomial time given access to an oracle for M), then proves that various protocols can solve tricky problems by getting the human player to solve problems in P that in reality aren’t typical human problems to solve.
I don’t see where the post says that humans can solve all problems solvable in polynomial time given access to an oracle. What Evan does is just replacing humans (which is a rather fuzzy category) with polynomial time algorithms (which represent the consensus in complexity theory for tractable algorithms).
Premise 1: We can’t build physical machines that solve problems outside of P.
On another note, you’re writing your comment on a physical device that can solve any computable problem, which obviously include problems outside of P (if only by the time hierarchy theorem). The value of P is that it captures problem that one can solve using physical machines in a way that scale reasonably, so you don’t need to take the lifespan of the universe to compute the answer for an instance of size 1000. But we clearly have algorithms for problems outside of P. We just believe/know they will take forever and/or too much space and/or too much energy.
You’re right, after rereading the proofs and talking with Evan, the only way for H to be polynomial time is to get oracle access to M. Which is slightly unintuitive, but makes sense because the part of the computation that depends on H and not on M is indeed polynomial time.
It seems to me that the following argument should hold:
Premise 1: We can’t build physical machines that solve problems outside of P.
Premise 2: Recursive alignment proposals (HCH, debate, market-making, recursive reward modelling) at equilibrium would solve problems outside of P.
Conclusion 1, from premises 1 and 2: We can’t build physical machines that implement equilibrium policies for recursive alignment proposals.
Premise 3: Existing arguments for the safety of recursive alignment proposals reason about their equilibrium policies.
Conclusion 2, from conclusion 1 and premise 3: If we were to build physical machines that were trained using recursive alignment proposals, existing arguments would not establish the safety of the resulting policies.
Note that this assumes that P does not contain PSPACE.
Also, the proofs in the post (or at least the proof that iterative amplification with weak HCH accesses PSPACE, which is the only one I’ve read so far) assume polynomial-time access to M, which in reality contradicts premise 2. So I’m not sure what to make of things.
After you pass this back through the analogy, it becomes
Seems much less true at that point.
I’m confused by this comment. The post assumes that humans can solve all problems in P (in fact, all problems solvable in polynomial time given access to an oracle for M), then proves that various protocols can solve tricky problems by getting the human player to solve problems in P that in reality aren’t typical human problems to solve. Therefore, I take P to actually mean P, rather than being an analogy for problems humans can solve.
Also, regarding your version of premise 1, I think I buy that AI can only give you a polynomial speedup over humans.
I’ve always understood it as an analogy (e.g. from the debate paper, “Although debate is intended for use with fuzzy humans as judges, we can gain intuition about the model by replacing the human with an arbitrary polynomial time algorithm”). But if I take your perspective, then we have:
Premise 0: Humans can solve all problems in P.
Premise 1: We can’t build physical machines that solve problems outside of P.
Conclusion: We can’t build physical machines that can solve problems that humans can’t.
Seems like you should probably ditch one of the premises.
(I do think premise 0 is pretty wrong—I can solve arbitrary SAT problems that have < 10 clauses, despite it being NP-complete, but I can’t multiply a trillion numbers together, even though that is in P. Complexity classes impose scaling limits; humans have resource limits; these are very different. Yes, I do know that “SAT problems with < 10 clauses” is technically in P.)
I think this is easily handled by saying “in practice, the models we train will not be literally optimal”. It’s still useful to know what would happen at optimality (e.g. I do in fact have more trust for deep RL algorithms that have proofs of convergence to optimal policies given infinite model capacity, compute and data).
I read the post as attempting to be literal, ctrl+F-ing “analog” doesn’t get me anything until the comments. Also, the post is the one that I read as assuming for the sake of analysis that humans can solve all problems in P, I myself wouldn’t necessarily assume that.
My guess is that the thing you mean is something like “Sure, the conclusion of the post is that optimal models can do more than a polynomial speedup over humans, but we won’t train optimal models, and in fact the things we train will just be a polynomial speedup over humans”, which is compatible with my argument in the top-level comment. But in general your comments make me think that you’re interpreting the post completely differently than I am. [EDIT: actually now that I re-read the second half of your comment it makes sense to me. I still am confused about what you think this post means.]
I mean, Evan can speak to what he meant. I strongly disagree with the literal interpretation of the post, regardless of what Evan meant, for the reason I gave above.
I usually think of the analogies as demonstrating the mechanism by which a particular scheme is meant to outperform a human. I do find these proof-analogies less compelling than the original one in debate because they’re simulating Turing machines which is not how I expect any of them to work in practice. (In contrast, the debate proof for accessing PSPACE was about narrowing in on disagreements as being equivalent to finding a winning path of a two-player game, which is the canonical “structure” of a PSPACE-complete problem and is similar to why we’d expect debate to incentivize honesty in practice.)
I mean, I assumed what I needed to in order to be able to do the proofs and have them make sense. What the proofs actually mean in practice is obviously up for debate, but I think that a pretty reasonable interpretation is that they’re something like analogies which help us get a handle on how powerful the different proposals are in theory.
I’m curious if you agree with the inference of conclusions 1 and 2 from premises 1, 2, and 3, and/or the underlying claim that it’s bad news to learn that your alignment scheme would be able to solve a very large complexity class.
I agree with the gist that it implies that arguments about the equilibrium policy don’t necessarily translate to real models, though I disagree that that’s necessarily bad news for the alignment scheme—it just means you need to find some guarantees that work even when you’re not at equilibrium.
I don’t see where the post says that humans can solve all problems solvable in polynomial time given access to an oracle. What Evan does is just replacing humans (which is a rather fuzzy category) with polynomial time algorithms (which represent the consensus in complexity theory for tractable algorithms).
On another note, you’re writing your comment on a physical device that can solve any computable problem, which obviously include problems outside of P (if only by the time hierarchy theorem). The value of P is that it captures problem that one can solve using physical machines in a way that scale reasonably, so you don’t need to take the lifespan of the universe to compute the answer for an instance of size 1000. But we clearly have algorithms for problems outside of P. We just believe/know they will take forever and/or too much space and/or too much energy.
TBC by “can” I mean “can in practice”. Also if we’re getting picky my laptop is just a finite state machine :)
From the post:
Yes, this says that humans can solve any problem in P. It says nothing about using M as an oracle.
It doesn’t talk about using M as an oracle, but otherwise I don’t see how the proofs pan out: for instance, how else is
supposed to only take polynomial time?
You’re right, after rereading the proofs and talking with Evan, the only way for H to be polynomial time is to get oracle access to M. Which is slightly unintuitive, but makes sense because the part of the computation that depends on H and not on M is indeed polynomial time.