Vladimir_Nesov comments on What can you do with an Unfriendly AI?

Vladimir_Nesov 20 Dec 2010 22:31 UTC
0 points

It is not magically incentivized to be honest. It is incentivized to be honest because each query is constructed precisely such that an honest answer is the rational thing to do, under relatively weak assumptions about its utility function. If you ask in plain English, you would actually need magic to produce the right incentives.

My question is about the difference. Why exactly is the plain question different from your scheme?

(Clearly your position is that your scheme works, and therefore “doesn’t assume any magic”, while the absence of your scheme doesn’t, and so “requires magic in order to work”. You haven’t told me anything I don’t already know, so it doesn’t help.)
- paulfchristiano 21 Dec 2010 1:10 UTC
  0 points
  Parent
  Here is the argument in the post more concisely. Hopefully this helps:
  
  It is impossible to lie and say “I was able to find a proof” by the construction of the verifier (if you claim you were able to find a proof, the verifier needs to see the proof to believe you.) So the only way you can lie is by saying “I was not able to find a proof” when you could have if you had really tried. So incentivizing the AI to be honest is precisely the same as incentivizing them to avoid admitting “I was not able to find a proof.” Providing such an incentive is not trivial, but it is basically the easiest possible incentive to provide.
  
  I know of no way to incentivize someone to answer the plain question easily just based on your ability to punish them or reward them when you choose to. Being able to punish them for lying involves being able to tell when they are lying.
  - Vladimir_Nesov 21 Dec 2010 2:04 UTC
    −2 points
    Parent
    See this comment.