It seems quite challenging to make the AI represent the posterior in a human understandable way. Moreover, if the attacker can manipulate the posterior, it can purposefully shape it to be malicious when inspected.
Also, this is precisely the case where experience with weaker systems is close to useless. This effect only appears when the agent is capable of reasoning at least as sophisticated as the reasoning you used to come up with this problem, so the agent will be at least as intelligent as Paul Christiano. More precisely, the agent would have to be able to reason in detail about possible superintelligences, including predicting their most likely utility functions. The first AI to have this property might already be superintelligent itself.
I suspect that worrying about “exotic” failure modes might be beneficial precisely because few other people will worry about them, and the reason few other people will worry about them is that they sound like something from science fiction (even more than the “normal” AI risk stuff), which is not a good reason.
In any case, I hope that at some point we will have a mathematical model sophisticated enough to formalise this failure mode, and that would allow thinking about it much more clearly.
It seems quite challenging to make the AI represent the posterior in a human understandable way. Moreover, if the attacker can manipulate the posterior, it can purposefully shape it to be malicious when inspected.
Also, this is precisely the case where experience with weaker systems is close to useless. This effect only appears when the agent is capable of reasoning at least as sophisticated as the reasoning you used to come up with this problem, so the agent will be at least as intelligent as Paul Christiano. More precisely, the agent would have to be able to reason in detail about possible superintelligences, including predicting their most likely utility functions. The first AI to have this property might already be superintelligent itself.
I suspect that worrying about “exotic” failure modes might be beneficial precisely because few other people will worry about them, and the reason few other people will worry about them is that they sound like something from science fiction (even more than the “normal” AI risk stuff), which is not a good reason.
In any case, I hope that at some point we will have a mathematical model sophisticated enough to formalise this failure mode, and that would allow thinking about it much more clearly.