True answers from AI: Summary

A putative new idea for AI control; index here.

This post summaries the methods for getting honest answers from certain AIs, presented in the previous two posts.

You can basically safely ask an AI for expected utility estimates (including conditional ones), probability estimates, and get the AI to identify events of highest and lowest probability and utility, and, arguably, highest and lowest conditional utility.

The setup is a boxed AI, which emits a message , and a stochastic stochastic process that would erase with low probability, before anyone could see it. There is also a stochastic process that emits a message . Let be the indicator function for and the indicator function for .


Then, given a utility bounded between and , and a specific , we can give the AI a utility that will cause it to give us the message (which we will see with high probability , an event the AI will not optimise for).

Define . The utility is then

  • .

Maximising this ensures that, in the AI’s best estimate,

  1. ,

  2. ,

  3. ,

  4. .

But we need not consider any specific . The AI could send out the message , while maximising the utility

  • .

Then will be such that, in the AI’s best estimate,

  1. ,

  2. ,

  3. ,

  4. ,

Finally, define . And set , and define the utility function:

  • .

Then, for large enough , the AI will output so that, according to its best estimate,

  1. ,

  2. .

Unfortunately the value of cannot be known ahead of time. It might be possible to define a “limit” utility .

Note that all these messages and utilities are independent, so the AI can generate messages when maximising

  • .

But there are issues with very low probabilities, as explained in the previous post.