This seems vulnerable to the typical self fulfilling prophecy stuff.
Unrolling the typical issue:
It seems likely to me that in situations where multiple self fulfilling prophecies are possible, features relating to what will happen have more structure than features relating to what did. So, this seems to my intuition like it might be end up that this theoretical framework allows for making a very reliable past-focused-only lie detector, which has some generalization to detecting future lies but is much further from robust about the future. Eg, see FixDT and work upstream of it for discussions of things like “the way you pick actions is that you decide what will be true, because you will believe it”.
You could get into situations where the AI isn’t lying when it says things it legitimately believes about, eg, its interlocutor; as a trivial example (though this would only work if the statement was in fact true about the future!) the ai might say, “since you’re really depressed and only barely managed to talk to me today, it seems like you won’t be very productive or have much impact after today.” and the interlocutor can’t really help but believe it because it’s only true because hearing it gets them down and makes them less productive. or alternately, “you won’t regret talking to me more” and then you end up wasting a lot of time talking to the AI. or various other such things. In other words, it doesn’t ban mesaoptimizers, and mesaoptimizers can, if sufficiently calibrated, believe things that are true-because-they-will-cause-them; it could just as well have been an affirmation, if the AI could have fully-consistent belief that saying an affirmation would in fact make the person more productive.
This seems vulnerable to the typical self fulfilling prophecy stuff.
Unrolling the typical issue:
It seems likely to me that in situations where multiple self fulfilling prophecies are possible, features relating to what will happen have more structure than features relating to what did. So, this seems to my intuition like it might be end up that this theoretical framework allows for making a very reliable past-focused-only lie detector, which has some generalization to detecting future lies but is much further from robust about the future. Eg, see FixDT and work upstream of it for discussions of things like “the way you pick actions is that you decide what will be true, because you will believe it”.
You could get into situations where the AI isn’t lying when it says things it legitimately believes about, eg, its interlocutor; as a trivial example (though this would only work if the statement was in fact true about the future!) the ai might say, “since you’re really depressed and only barely managed to talk to me today, it seems like you won’t be very productive or have much impact after today.” and the interlocutor can’t really help but believe it because it’s only true because hearing it gets them down and makes them less productive. or alternately, “you won’t regret talking to me more” and then you end up wasting a lot of time talking to the AI. or various other such things. In other words, it doesn’t ban mesaoptimizers, and mesaoptimizers can, if sufficiently calibrated, believe things that are true-because-they-will-cause-them; it could just as well have been an affirmation, if the AI could have fully-consistent belief that saying an affirmation would in fact make the person more productive.