Your comment made me re-read the post more carefully. I had on first reading assumed that a truthful answer was rewarded (whether yes or no) and a lying answer was punished. If a yes is rewarded and a no is punished, and our AI genies are so afraid of termination that they would never give a ‘no’ where they could give a ‘yes’, why wouldn’t they all give us ‘yes’?
Recall the filter between the AI and the world. The AI doesn’t directly say “yes/no.” The AI gives a proof to the filter which then says either “yes, the AI found a proof” or “no, the AI didn’t.” So they do all say ‘yes’ if they can. I will modify the post to be more clear.
As I understand it this method is designed to work for constraint satisfaction problems -where we can easily detect false positives. You’re right that a possibility is that all the genies that can’t find solutions go on strike just to make us check all the yes’s (which would make this process no better than a brute force search, right?), maybe there needs to be a second punishment that is worse than death to give them an incentive not to lie.
A genie who can’t find a solution has literally no agency. There is nothing he can say to the filter which will cause it to say “yes,” because the filter itself checks to see if the genie has given a proof. If the genie can’t find a proof, the filter will always say “no.” I don’t quite know what going on strike would entail, but certainly if all of the genies who can’t find solutions collectively have 0 influence in the world, we don’t care if they strike.
Okay, that makes sense. What about computation time limits? A genie that knows it can’t give an answer would wait as long as possible before saying anything.
I mention timing in the post; the AI gets some fixed interval, at the end of which the filter outputs whether or not they have a proof. If you can’t change what the filter says, then you don’t get to affect the world.
Your comment made me re-read the post more carefully. I had on first reading assumed that a truthful answer was rewarded (whether yes or no) and a lying answer was punished. If a yes is rewarded and a no is punished, and our AI genies are so afraid of termination that they would never give a ‘no’ where they could give a ‘yes’, why wouldn’t they all give us ‘yes’?
Recall the filter between the AI and the world. The AI doesn’t directly say “yes/no.” The AI gives a proof to the filter which then says either “yes, the AI found a proof” or “no, the AI didn’t.” So they do all say ‘yes’ if they can. I will modify the post to be more clear.
As I understand it this method is designed to work for constraint satisfaction problems -where we can easily detect false positives. You’re right that a possibility is that all the genies that can’t find solutions go on strike just to make us check all the yes’s (which would make this process no better than a brute force search, right?), maybe there needs to be a second punishment that is worse than death to give them an incentive not to lie.
A genie who can’t find a solution has literally no agency. There is nothing he can say to the filter which will cause it to say “yes,” because the filter itself checks to see if the genie has given a proof. If the genie can’t find a proof, the filter will always say “no.” I don’t quite know what going on strike would entail, but certainly if all of the genies who can’t find solutions collectively have 0 influence in the world, we don’t care if they strike.
Okay, that makes sense. What about computation time limits? A genie that knows it can’t give an answer would wait as long as possible before saying anything.
I mention timing in the post; the AI gets some fixed interval, at the end of which the filter outputs whether or not they have a proof. If you can’t change what the filter says, then you don’t get to affect the world.