Neural nets have adversarial examples. Adversarial optimization of part of the input can make the network do all sorts of things, including computations.
If you optimise the inputs to a buggy program hard enough, you get something that crashes the program in a way that happens to score highly.
Of course, if we really have myopic debate, not any mesaoptimisers, then neither AI is optimizing to have a long term effect or to avoid long term effects. They are optimising for a short term action, and to defend against their adversary.
If AI1 manages to “persuade” the human not to look at AI2′s “arguments” early on, then AI1 has free reign to optimise, the human could end up as a UFAI mesa-optimiser. Suppose the AI1 is rewarded by pressing a red button. The human could walk out of their still human level smart, but with the sole terminal goal of maximizing the number of red buttons pressed universewide. If the human is an AI researcher, this could potentially end with them making a button pressing ASI
Another option I consider fairly likely is that the debater ends up permanently non-functional. Possibly dead, possibly a twitching mess. After all, any fragment of behavioural pattern resembling “functional human” is attack surface. Suppose there is a computer running a buggy and insecure code. You are given first go at hacking it. After your go, someone else will try to hack it, and your code has to repel them. You are both optimising for a simple formal objective, like average pixel colour of screen.
You can get your virus into the system, now you want to make it as hard for your opponent to follow after you as possible. Your virus will probably wipe the OS, cut all network connections and blindly output your preferred output.
That’s plausibly a good strategy, a simple cognitive pattern that repeats itself and blindly outputs your preferred action, wiping away any complex dependence on observation that could be used to break the cycle.
A third possibility is just semi nonsense that creates a short term compulsion or temporary trance state.
The human brain can recover fairly well from the unusual states caused by psychedelics. I don’t know how that compares to recovering from unusual states caused by strong optimization pressure. In the ancestral environment, there would be some psychoactive substances, and some weak adversarial optimization from humans.
I would be more surprised if optimal play gave an answer that was like an actual plan. (This is the case of a plan for a perpetual motion machine, and a detailed technical physics argument for why it should work, that just has one small 0⁄0 hidden somewhere in the proof.)
I would be even more surprised if the plan actually worked, if the optimal debating AI’s actually outputed highly useful info.
For AI’s strongly restricted in output length, I think it might produce useful info on the level of maths hints, “renormalize the vector, then differentiate”. something that a human can follow and check quickly. If you don’t have the bandwidth to hack the brain, you can’t send complex novel plans, just a little hint towards a problem the human was on the verge of solving themselves. Of course, the humans might well follow the rules in this situation.
Sure—there are many ways for debate to fail with extremely capable debaters. Though most of the more exotic mind-hack-style outcomes seem a lot less likely once you’re evaluating local nodes with ~1000 characters for each debater.
However, all of this comes under my: I’ll often omit the caveat “If debate works as intended aside from this issue…”
There are many ways for debate to fail. I’m pointing out what happens even if it works. I.e. I’m claiming that question-ignoring will happen even if the judge is only ever persuaded of true statements, gets a balanced view of things, and is neither manipulated, nor mind-hacked (unless you believe a response of “Your house is on fire” to “What is 2 + 2?” is malign, if your house is indeed on fire).
Debate can ‘work’ perfectly, with the judge only ever coming to believe true statements, and your questions will still usually not be answered. (because [believing X is the better answer to the question] and [deciding X should be the winning answer, given the likely consequences] are not the same thing)
The fundamental issue is: [what the judge most wants] is not [the best direct answer to the question asked].
Neural nets have adversarial examples. Adversarial optimization of part of the input can make the network do all sorts of things, including computations.
If you optimise the inputs to a buggy program hard enough, you get something that crashes the program in a way that happens to score highly.
I suspect that optimal play on most adversarial computer games looks like a game of core wars. https://en.wikipedia.org/wiki/Core_War
Of course, if we really have myopic debate, not any mesaoptimisers, then neither AI is optimizing to have a long term effect or to avoid long term effects. They are optimising for a short term action, and to defend against their adversary.
If AI1 manages to “persuade” the human not to look at AI2′s “arguments” early on, then AI1 has free reign to optimise, the human could end up as a UFAI mesa-optimiser. Suppose the AI1 is rewarded by pressing a red button. The human could walk out of their still human level smart, but with the sole terminal goal of maximizing the number of red buttons pressed universewide. If the human is an AI researcher, this could potentially end with them making a button pressing ASI
Another option I consider fairly likely is that the debater ends up permanently non-functional. Possibly dead, possibly a twitching mess. After all, any fragment of behavioural pattern resembling “functional human” is attack surface. Suppose there is a computer running a buggy and insecure code. You are given first go at hacking it. After your go, someone else will try to hack it, and your code has to repel them. You are both optimising for a simple formal objective, like average pixel colour of screen.
You can get your virus into the system, now you want to make it as hard for your opponent to follow after you as possible. Your virus will probably wipe the OS, cut all network connections and blindly output your preferred output.
That’s plausibly a good strategy, a simple cognitive pattern that repeats itself and blindly outputs your preferred action, wiping away any complex dependence on observation that could be used to break the cycle.
A third possibility is just semi nonsense that creates a short term compulsion or temporary trance state.
The human brain can recover fairly well from the unusual states caused by psychedelics. I don’t know how that compares to recovering from unusual states caused by strong optimization pressure. In the ancestral environment, there would be some psychoactive substances, and some weak adversarial optimization from humans.
I would be more surprised if optimal play gave an answer that was like an actual plan. (This is the case of a plan for a perpetual motion machine, and a detailed technical physics argument for why it should work, that just has one small 0⁄0 hidden somewhere in the proof.)
I would be even more surprised if the plan actually worked, if the optimal debating AI’s actually outputed highly useful info.
For AI’s strongly restricted in output length, I think it might produce useful info on the level of maths hints, “renormalize the vector, then differentiate”. something that a human can follow and check quickly. If you don’t have the bandwidth to hack the brain, you can’t send complex novel plans, just a little hint towards a problem the human was on the verge of solving themselves. Of course, the humans might well follow the rules in this situation.
Sure—there are many ways for debate to fail with extremely capable debaters. Though most of the more exotic mind-hack-style outcomes seem a lot less likely once you’re evaluating local nodes with ~1000 characters for each debater.
However, all of this comes under my:
I’ll often omit the caveat “If debate works as intended aside from this issue…”
There are many ways for debate to fail. I’m pointing out what happens even if it works.
I.e. I’m claiming that question-ignoring will happen even if the judge is only ever persuaded of true statements, gets a balanced view of things, and is neither manipulated, nor mind-hacked (unless you believe a response of “Your house is on fire” to “What is 2 + 2?” is malign, if your house is indeed on fire).
Debate can ‘work’ perfectly, with the judge only ever coming to believe true statements, and your questions will still usually not be answered. (because [believing X is the better answer to the question] and [deciding X should be the winning answer, given the likely consequences] are not the same thing)
The fundamental issue is: [what the judge most wants] is not [the best direct answer to the question asked].