Fun post, even though I don’t expect debate of either form to see much use (because resolving tough real world questions offers too many chances for the equivalent of the plane simulation to have fatal flaws).
Thank you! If I may ask, what kind of fatal flaws do you expect for real-world simulations? Underspecified / ill-defined questions, buggy simulation software, multiple simulation programs giving irreconcilably conflicting answers in practice, etc.? I ask because I think that in some situations it’s reasonable to imagine the AI debaters providing the simulation software themselves if they can formally verify its accuracy, but that would struggle against e.g. underspecified questions.
Also, is there some prototypical example of a “tough real world question” you have in mind? I will gladly concede that not all questions naturally fit into this framework. I was primarily inspired by physical security questions like biological attacks or backdoors in mechanical hardware.
For topological debate that’s about two agents picking settings for simulation/computation, where those settings have a partial order that lets you take the “strictest” combination, a big class of fatal flaw would be if you don’t actually have the partial order you think you have within the practical range of the settings—i.e. if some settings you thought were more accurate/strict are actually systematically less accurate.
In the 1D plane example, this would be if some specific length scales (e.g. exact powers of 1000) cause simulation error, but as long as they’re rare, this is pretty easy to defend against.
In the fine-grained plane example, though, there’s a lot more room for fine-grained patterns in which parts of the plane get modeled at which length scale to start having nonlinear effects. If the agents are not allowed to bid “maximum resolution across the entire plane,” and instead are forced to allocate resources cleverly, then maybe you have a problem. But hopefully the truth is still advantaged, because the false player has to rely on fairly specific correlations, and the true player can maybe bid a bunch of noise that disrupts almost all of them.
(This makes possible a somewhat funny scene, where the operator expected the true player’s bid to look “normal,” and then goes to check the bids and both look like alien noise patterns.)
An egregious case would be where it’s harder to disrupt patterns injected during bids—e.g. if the players’ bids are ‘sparse’ / have finite support and might not overlap. Then the notion of the true player just needing to disrupt the false player seems a lot more unlikely, and both players might get pushed into playing very similar strategies that take every advantage of the dynamics of the simulator in order to control the answer in an unintended way.
I guess for a lot of “tough real world questions,” the difficulty of making a super-accurate simulator (one you even hope converges to the right answer) torpedoes the attempt before we have to start worrying about this kind of ‘fatal flaw’. But anything involving biology, human judgment, or too much computer code seems tough. “Does this gene therapy work?” might be something you could at least imagine a simulator for that still seems like it gives lots of opportunity for the false player.
These are all excellent points! I agree that these could be serious obstacles in practice. I do think that there are some counter-measures in practice, though.
I think the easiest to address is the occasional random failure, e.g. your “giving the wrong answer on exact powers of 1000” example. I would probably try to address this issue by looking at stochastic models of computation, e.g. probabilistic Turing machines. You’d need to accommodate stochastic simulations anyway because so many techniques in practice use sampling. I think you can handle stochastic evaluation maps in a similar fashion to the deterministic case but everything gets a bit more annoying (e.g. you likely need to include an incentive to point to simple world models if you want the game to have any equilibria at all). Anyways, if you’ve figured out topological debate in the stochastic case, then you can reduce from the occasional-errors problem to the stochastic problem as follows: suppose (W,≤) is a directed set of world models and E is some simulation software. Define a stochastic program E′ which takes in a world model w, randomly samples a world model w′≥w according to some reasonably-spread-out distribution, and return E(w′). In the 1D plane case, for example, you could take in a given resolution, divide it by a uniformly random real number in (1,10), and then run the simulation at that new resolution. If your errors are sufficiently rare then your stochastic topological debate setup should handle things from here.
Somewhat more serious is the case where “it’s harder to disrupt patterns injected during bids.” Mathematically I interpret this statement as the existence of a world model which evaluates to the wrong answer such that you have to take a vastly more computationally intensive refinement to get the correct answer. I think it’s reasonable to detect when this problem is occurring but preventing it seems hard: you’d basically need to create a better simulation program which doesn’t suffer from the same issue. For some problems that could be a tall order without assistance but if your AI agents are submitting the programs themselves subject to your approval then maybe it’s surmountable.
What I find the most serious and most interesting, though, is the case where your simulation software simply might not converge to the truth. To expand on your nonlinear effects example: suppose our resolution map can specify dimensions of individual grid cells. Suppose that our simulation software has a glitch where, if you alternate the sizes of grid cells along some direction, the simulation gets tricked into thinking the material has a different stiffness or something. This is a kind of glitch which both sides can exploit and the net probably won’t converge to anything.
I find this problem interesting because it attacks one of the core vulnerabilities that I think debate problems struggle with: grounding in reality. You can’t really design a system to “return the correct answer” without somehow specifying what makes an answer correct. I tried to ground topological debate in this pre-existing ordering on computations that gets handed to us and which is taken to be a canonical characterization of the problem we want to solve. In practice, though, that’s really just kicking the can down the road: any user would have to come up with a simulation program or method of comparing simulation programs which encapsulates their question of interest. That’s not an easy task.
Still, I don’t think we need to give up so easily. Maybe we don’t ground ourselves by assuming that the user has a simulation program but instead ground ourselves by assuming that the user can check whether a simulation program or comparison between simulation programs is valid. For example, suppose we’re in the alternating-grid-cell-sizes example. Intuitively the correct debater should be able to isolate an extreme example and go to the human and say “hey, this behavior is ridiculous, your software is clearly broken here!” I will think about what a mathematical model of this setup might look like. Of course, we just kicked the can down the road, but I think that there should be some perturbation of these ideas which is practical and robust.
Maybe I should also expand on what the “AI agents are submitting the programs themselves subject to your approval” scenario could look like. When I talk about a preorder on Turing Machines (or some subset of Turing Machines), you don’t have to specify this preorder up front. You just have to be able to evaluate it and the debaters have to be able to guess how it will evaluate.
If you already have a specific simulation program in mind then you can define ≤ as follows: if you’re handed two programs which are exact copies of your simulation software using different hard-coded world models then you consult your ordering on world models, if one submission is even a single character different from your intended program then it’s automatically less, if both programs differ from your program then you decide arbitrarily. What’s nice about the “ordering on computations” perspective is that it naturally generalizes to situations where you don’t follow this construction.
What could happen if we don’t supply our own simulation program via this construction? In the planes example, maybe the “snap” debater hands you a 50,000-line simulation program with a bug so that if you’re crafty with your grid sizes then it’ll get confused about the material properties and give the wrong answer. Then the “safe” debater might hand you a 200,000-line simulation program which avoids / patches the bug so that the crafty grid sizes now give the correct answer. Of course, there’s nothing stopping the “safe” debater from having half of those lines be comments containing a Lean proof using super annoying numerical PDE bounds or whatever to prove that the 200,000-line program avoids the same kind of bug as the 50,000-line program.
When you think about it that way, maybe it’s reasonable to give the “it’ll snap” debater a chance to respond to the “it’s safe” debater’s comments. Now maybe we change the type of ≤ from being a subset of (Turing Machines) x (Turing Machines) to being a subset of (Turing Machines) x (Turing Machines) x (Justifications from safe debater) x (Justifications from snap debater). In this manner deciding how you want ≤ to behave can become a computational problem in its own right.
Fun post, even though I don’t expect debate of either form to see much use (because resolving tough real world questions offers too many chances for the equivalent of the plane simulation to have fatal flaws).
Thank you! If I may ask, what kind of fatal flaws do you expect for real-world simulations? Underspecified / ill-defined questions, buggy simulation software, multiple simulation programs giving irreconcilably conflicting answers in practice, etc.? I ask because I think that in some situations it’s reasonable to imagine the AI debaters providing the simulation software themselves if they can formally verify its accuracy, but that would struggle against e.g. underspecified questions.
Also, is there some prototypical example of a “tough real world question” you have in mind? I will gladly concede that not all questions naturally fit into this framework. I was primarily inspired by physical security questions like biological attacks or backdoors in mechanical hardware.
For topological debate that’s about two agents picking settings for simulation/computation, where those settings have a partial order that lets you take the “strictest” combination, a big class of fatal flaw would be if you don’t actually have the partial order you think you have within the practical range of the settings—i.e. if some settings you thought were more accurate/strict are actually systematically less accurate.
In the 1D plane example, this would be if some specific length scales (e.g. exact powers of 1000) cause simulation error, but as long as they’re rare, this is pretty easy to defend against.
In the fine-grained plane example, though, there’s a lot more room for fine-grained patterns in which parts of the plane get modeled at which length scale to start having nonlinear effects. If the agents are not allowed to bid “maximum resolution across the entire plane,” and instead are forced to allocate resources cleverly, then maybe you have a problem. But hopefully the truth is still advantaged, because the false player has to rely on fairly specific correlations, and the true player can maybe bid a bunch of noise that disrupts almost all of them.
(This makes possible a somewhat funny scene, where the operator expected the true player’s bid to look “normal,” and then goes to check the bids and both look like alien noise patterns.)
An egregious case would be where it’s harder to disrupt patterns injected during bids—e.g. if the players’ bids are ‘sparse’ / have finite support and might not overlap. Then the notion of the true player just needing to disrupt the false player seems a lot more unlikely, and both players might get pushed into playing very similar strategies that take every advantage of the dynamics of the simulator in order to control the answer in an unintended way.
I guess for a lot of “tough real world questions,” the difficulty of making a super-accurate simulator (one you even hope converges to the right answer) torpedoes the attempt before we have to start worrying about this kind of ‘fatal flaw’. But anything involving biology, human judgment, or too much computer code seems tough. “Does this gene therapy work?” might be something you could at least imagine a simulator for that still seems like it gives lots of opportunity for the false player.
These are all excellent points! I agree that these could be serious obstacles in practice. I do think that there are some counter-measures in practice, though.
I think the easiest to address is the occasional random failure, e.g. your “giving the wrong answer on exact powers of 1000” example. I would probably try to address this issue by looking at stochastic models of computation, e.g. probabilistic Turing machines. You’d need to accommodate stochastic simulations anyway because so many techniques in practice use sampling. I think you can handle stochastic evaluation maps in a similar fashion to the deterministic case but everything gets a bit more annoying (e.g. you likely need to include an incentive to point to simple world models if you want the game to have any equilibria at all). Anyways, if you’ve figured out topological debate in the stochastic case, then you can reduce from the occasional-errors problem to the stochastic problem as follows: suppose (W,≤) is a directed set of world models and E is some simulation software. Define a stochastic program E′ which takes in a world model w, randomly samples a world model w′≥w according to some reasonably-spread-out distribution, and return E(w′). In the 1D plane case, for example, you could take in a given resolution, divide it by a uniformly random real number in (1,10), and then run the simulation at that new resolution. If your errors are sufficiently rare then your stochastic topological debate setup should handle things from here.
Somewhat more serious is the case where “it’s harder to disrupt patterns injected during bids.” Mathematically I interpret this statement as the existence of a world model which evaluates to the wrong answer such that you have to take a vastly more computationally intensive refinement to get the correct answer. I think it’s reasonable to detect when this problem is occurring but preventing it seems hard: you’d basically need to create a better simulation program which doesn’t suffer from the same issue. For some problems that could be a tall order without assistance but if your AI agents are submitting the programs themselves subject to your approval then maybe it’s surmountable.
What I find the most serious and most interesting, though, is the case where your simulation software simply might not converge to the truth. To expand on your nonlinear effects example: suppose our resolution map can specify dimensions of individual grid cells. Suppose that our simulation software has a glitch where, if you alternate the sizes of grid cells along some direction, the simulation gets tricked into thinking the material has a different stiffness or something. This is a kind of glitch which both sides can exploit and the net probably won’t converge to anything.
I find this problem interesting because it attacks one of the core vulnerabilities that I think debate problems struggle with: grounding in reality. You can’t really design a system to “return the correct answer” without somehow specifying what makes an answer correct. I tried to ground topological debate in this pre-existing ordering on computations that gets handed to us and which is taken to be a canonical characterization of the problem we want to solve. In practice, though, that’s really just kicking the can down the road: any user would have to come up with a simulation program or method of comparing simulation programs which encapsulates their question of interest. That’s not an easy task.
Still, I don’t think we need to give up so easily. Maybe we don’t ground ourselves by assuming that the user has a simulation program but instead ground ourselves by assuming that the user can check whether a simulation program or comparison between simulation programs is valid. For example, suppose we’re in the alternating-grid-cell-sizes example. Intuitively the correct debater should be able to isolate an extreme example and go to the human and say “hey, this behavior is ridiculous, your software is clearly broken here!” I will think about what a mathematical model of this setup might look like. Of course, we just kicked the can down the road, but I think that there should be some perturbation of these ideas which is practical and robust.
Maybe I should also expand on what the “AI agents are submitting the programs themselves subject to your approval” scenario could look like. When I talk about a preorder on Turing Machines (or some subset of Turing Machines), you don’t have to specify this preorder up front. You just have to be able to evaluate it and the debaters have to be able to guess how it will evaluate.
If you already have a specific simulation program in mind then you can define ≤ as follows: if you’re handed two programs which are exact copies of your simulation software using different hard-coded world models then you consult your ordering on world models, if one submission is even a single character different from your intended program then it’s automatically less, if both programs differ from your program then you decide arbitrarily. What’s nice about the “ordering on computations” perspective is that it naturally generalizes to situations where you don’t follow this construction.
What could happen if we don’t supply our own simulation program via this construction? In the planes example, maybe the “snap” debater hands you a 50,000-line simulation program with a bug so that if you’re crafty with your grid sizes then it’ll get confused about the material properties and give the wrong answer. Then the “safe” debater might hand you a 200,000-line simulation program which avoids / patches the bug so that the crafty grid sizes now give the correct answer. Of course, there’s nothing stopping the “safe” debater from having half of those lines be comments containing a Lean proof using super annoying numerical PDE bounds or whatever to prove that the 200,000-line program avoids the same kind of bug as the 50,000-line program.
When you think about it that way, maybe it’s reasonable to give the “it’ll snap” debater a chance to respond to the “it’s safe” debater’s comments. Now maybe we change the type of ≤ from being a subset of (Turing Machines) x (Turing Machines) to being a subset of (Turing Machines) x (Turing Machines) x (Justifications from safe debater) x (Justifications from snap debater). In this manner deciding how you want ≤ to behave can become a computational problem in its own right.