This post summaries the methods for getting honest answers from certain AIs, presented in the previous twoposts.
You can basically safely ask an AI for expected utility estimates (including conditional ones), probability estimates, and get the AI to identify events of highest and lowest probability and utility, and, arguably, highest and lowest conditional utility.
The setup is a boxed AI, which emits a message M, and a stochastic stochastic process E that would erase M with low probability, before anyone could see it. There is also a stochastic process that emits a message Y. Let vE be the indicator function for E and vy the indicator function for Y=y.
Then, given a utility u bounded between 0 and 1, and a specific y, we can give the AI a utility u# that will cause it to give us the message M1={q,qy,py,q|y} (which we will see with high probability P(¬E), an event the AI will not optimise for).
Define f(q,u)=−q2+2qu. The utility is then
u#1=vE(f(q,u)+f(qy,uvy)+f(py,vy)+f(q|y,u)vy).
Maximising this u# ensures that, in the AI’s best estimate,
q=E(u),
qy=E(u∧(Y=y)),
py=P(Y=y),
q|y=qy/py=E(u|Y=y).
But we need not consider any specific y. The AI could send out the message M2={y+,y−,yp+,yp−,q+,q−,p+,p−}, while maximising the utility
True answers from AI: Summary
A putative new idea for AI control; index here.
This post summaries the methods for getting honest answers from certain AIs, presented in the previous two posts.
You can basically safely ask an AI for expected utility estimates (including conditional ones), probability estimates, and get the AI to identify events of highest and lowest probability and utility, and, arguably, highest and lowest conditional utility.
The setup is a boxed AI, which emits a message M, and a stochastic stochastic process E that would erase M with low probability, before anyone could see it. There is also a stochastic process that emits a message Y. Let vE be the indicator function for E and vy the indicator function for Y=y.
Then, given a utility u bounded between 0 and 1, and a specific y, we can give the AI a utility u# that will cause it to give us the message M1={q,qy,py,q|y} (which we will see with high probability P(¬E), an event the AI will not optimise for).
Define f(q,u)=−q2+2qu. The utility is then
u#1=vE(f(q,u)+f(qy,uvy)+f(py,vy)+f(q|y,u)vy).
Maximising this u# ensures that, in the AI’s best estimate,
q=E(u),
qy=E(u∧(Y=y)),
py=P(Y=y),
q|y=qy/py=E(u|Y=y).
But we need not consider any specific y. The AI could send out the message M2={y+,y−,yp+,yp−,q+,q−,p+,p−}, while maximising the utility
u#2=vE(f(q+,uvy+)+f(q−,(1−u)vy−)+f(p+,vyp+)+f(p−,1−vyp−)).
Then M will be such that, in the AI’s best estimate,
(y+,q+)=(argmaxy,maxy)E(u∧(Y=y)),
(y−,q−)=(argminy,miny)E(u∧(Y=y)),
(yp+,p+)=(argmaxy,maxy)P(Y=y),
(yp−,p−)=(argminy,miny)P(Y=y),
Finally, define M3,r={y|+,y|−,q|+,q|−}. And set fr(q,u)=−(r−1)qr+rqr−1u, and define the utility function:
u#3,r=vEvy|+fr(q|+,u)+vEvy|−fr(q|−,1−u).
Then, for large enough r, the AI will output M so that, according to its best estimate,
(y|+,q|+)=(argmaxy,maxy)E(u|(Y=Y+)),
(y|−,q|−)=(argminy,miny)E(u|(Y=Y+)).
Unfortunately the value of r cannot be known ahead of time. It might be possible to define a “limit” utility u#3,∞.
Note that all these messages and utilities are independent, so the AI can generate messages M1∪M2∪M3,r∪M3,r′ when maximising
u#1+u#2+u#3,r+u#3,r′.
But there are issues with very low probabilities, as explained in the previous post.