If you ask questions of the form “will the AI do X in situation Y” the answer will usually be the answer you want to hear for a UDT AI. This is because the AI if hypothetically presented with situation Y would predict the same situation might be used for testing it. This makes the whole thing a Newcomb problem from its point of view and it will behave itself until presented an opportunity for exploitation (that is, a situation outside the test). The idea might be salvaged if instead of considering possible actions of the AI we consider possible results of processes outside the AI’s control.
Btw, I haven’t fully grokked your method of getting honest answers but another method is using a safe math oracle or an unsafe oracle in a cryptographic box (as long as the answer to the question is sufficiently compact to prevent the unsafe oracle from exploiting this access to the outside world).
EDIT: Although the “safe” math oracle approach is of limited use since if the question contains anything hinting at the unsafe AI the math oracle might decide to run the unsafe AI’s code.
This is because the AI if hypothetically presented with situation Y would predict the same situation might be used for testing it.
If we use the modify utility version of probability change, the AI cares only about those (very very very rare) universes in which X is what it wants and it’s output is not read.
When you say “we can check what would happen if the AI was given the ability to launch the world’s nuclear arsenals”, do you mean the question you’re really asking is “would it be a good (from the AI’s perspective) if the nuclear missiles were launched as a result of some fluke”? Because if you’re literally asking “would the AI launch nuclear missiles” than you run into the Newcomb obstacle I described since the AI that might launch nuclear missiles is the “wild type” AI without special control handicaps.
Also, there are questions for which you don’t want to know the detailed answer. For example “the text of a speech of an important world leader” that the AI would create is something to which you don’t want to expose your own brain.
If you ask questions of the form “will the AI do X in situation Y” the answer will usually be the answer you want to hear for a UDT AI. This is because the AI if hypothetically presented with situation Y would predict the same situation might be used for testing it. This makes the whole thing a Newcomb problem from its point of view and it will behave itself until presented an opportunity for exploitation (that is, a situation outside the test). The idea might be salvaged if instead of considering possible actions of the AI we consider possible results of processes outside the AI’s control.
Btw, I haven’t fully grokked your method of getting honest answers but another method is using a safe math oracle or an unsafe oracle in a cryptographic box (as long as the answer to the question is sufficiently compact to prevent the unsafe oracle from exploiting this access to the outside world).
EDIT: Although the “safe” math oracle approach is of limited use since if the question contains anything hinting at the unsafe AI the math oracle might decide to run the unsafe AI’s code.
If we use the modify utility version of probability change, the AI cares only about those (very very very rare) universes in which X is what it wants and it’s output is not read.
When you say “we can check what would happen if the AI was given the ability to launch the world’s nuclear arsenals”, do you mean the question you’re really asking is “would it be a good (from the AI’s perspective) if the nuclear missiles were launched as a result of some fluke”? Because if you’re literally asking “would the AI launch nuclear missiles” than you run into the Newcomb obstacle I described since the AI that might launch nuclear missiles is the “wild type” AI without special control handicaps.
Also, there are questions for which you don’t want to know the detailed answer. For example “the text of a speech of an important world leader” that the AI would create is something to which you don’t want to expose your own brain.