While thinking about ELK (we’ll see how my submission did before I talk about it), I ended up framing-for-myself why ELK is difficult, which seems good to post (but short enough that I’m going to stick it on my shortform instead of make a top-level post). I don’t think any of the following is ‘new’ instead of restating things already in the ELK doc.
I think there’s a general counterargument to transparency-style approaches solving ELK. Suppose I have a way of checking the AI’s current ‘mental state’ / world-model to see whether or not it believes the sensors are corrupted. Can’t there now be a strategy of “corrupt the sensors and then forget about it”? Transparency analysis on the AI would see it honestly reporting that everything is as it should be.
If that strategy works, then I think the hope of ELK has to be not retrospective but prospective: rather than just checking after the fact that it didn’t fool you (because it could fool itself about whether or not it’s fooling you), you check before the AI implements any plans that it’s not planning to fool you. But isn’t this just the same problem, where it can construct a plan that will fool the sensors on the innermost layer, create an outer layer that it fools to believe the plan won’t do that, and then you check the plan, both you and the outer layer think the plan is ok, and then the sensors get fooled?
In general, it seems like either A) the thinking is happening where you told it to happen or B) you are checking a black box after the fact to see where the thinking happened, and in the worst case for B I think you can always posit “and then the thinking happened in another hidden spot.” [That is, the ‘reporter’ head can fail to understand what’s happening in the “figure out what’s going on” node, and there’s an incentive to deceive the reporter by thinking in obscure ways.]
As I understand the case for optimism, it deals mostly not with worst-case analysis but with plausibility analysis; I can maybe tell a training story where the thinking is incentivized to happen in observable places and it becomes less and less likely that there’s hidden thinking that corrupts the outputs. [Or I need to have arranged the cognition myself in a way that seems good.]
While thinking about ELK (we’ll see how my submission did before I talk about it), I ended up framing-for-myself why ELK is difficult, which seems good to post (but short enough that I’m going to stick it on my shortform instead of make a top-level post). I don’t think any of the following is ‘new’ instead of restating things already in the ELK doc.
I think there’s a general counterargument to transparency-style approaches solving ELK. Suppose I have a way of checking the AI’s current ‘mental state’ / world-model to see whether or not it believes the sensors are corrupted. Can’t there now be a strategy of “corrupt the sensors and then forget about it”? Transparency analysis on the AI would see it honestly reporting that everything is as it should be.
If that strategy works, then I think the hope of ELK has to be not retrospective but prospective: rather than just checking after the fact that it didn’t fool you (because it could fool itself about whether or not it’s fooling you), you check before the AI implements any plans that it’s not planning to fool you. But isn’t this just the same problem, where it can construct a plan that will fool the sensors on the innermost layer, create an outer layer that it fools to believe the plan won’t do that, and then you check the plan, both you and the outer layer think the plan is ok, and then the sensors get fooled?
In general, it seems like either A) the thinking is happening where you told it to happen or B) you are checking a black box after the fact to see where the thinking happened, and in the worst case for B I think you can always posit “and then the thinking happened in another hidden spot.” [That is, the ‘reporter’ head can fail to understand what’s happening in the “figure out what’s going on” node, and there’s an incentive to deceive the reporter by thinking in obscure ways.]
As I understand the case for optimism, it deals mostly not with worst-case analysis but with plausibility analysis; I can maybe tell a training story where the thinking is incentivized to happen in observable places and it becomes less and less likely that there’s hidden thinking that corrupts the outputs. [Or I need to have arranged the cognition myself in a way that seems good.]