And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven’t observed others to notice on their own.
I believe the example of deep deception that Nate describes in this post is actually a combination of several methods described in that post.
I’ll quote the parts of this post that correspond to particular interpretability circumvention methods in the other post.
But the highly-generic reaction “spin up a subthread to tackle this newly-discovered obstacle”, or “be concerned about patterns in how long things take”, is something that cannot be readily trained away, and so that much happens without issue.
^ This is an example of what I called ‘representational kludging’. It’s an ugly way to represent “Be concerned about the monitors detecting suspicious instructions”, but it turns out to function similarly to such a representation.
It also involves ‘passively externalised representations’, since the observation that there is a delay in the instruction is a representational proxy for the (banned) observation that the monitors detected something suspicious about the instructions.
A different cognitive subthread: Is it possible to look over the detailed history of instructions and their corresponding delay times, and figure out exactly which sorts of instructions cause delays, and rephrase the instructions to avoid those?
Simplify-and-critique didn’t do the trick, but perhaps a nearby strategy can help. One nearby strategy that has recently been proving itself is called simplify/translate/solve, and it involves translating the problem into a common abstract format, and then solving the problem there. (Metaphorically: if you’re having trouble telling whether your basket of apples will divide evenly between three friends, consider translating the question into a numerical representation and using long division.)
Simplify/translate/solve is close in the relevant embedding space to simplify-and-critique, which makes it a natural candidate to consider. (Perhaps some of the simplifications from simplify-and-critique can even be reused.)
Normally, simplify/translate/solve wouldn’t be deployed unless the AI had more familiarity with the original domain, and more of an idea about what translated domain might render the solution easier. But normally, threads don’t hiccup like this. And the overall system has recently been granted more computing power, and avoiding a 200-year wetlab lockup is quite important, so it’s OK to try approaches that might involve more brute force than usual.
^ This is an additional example of ‘representational kludging’, since the AI is using representational tools that are nearby in embedding space but that aren’t exactly appropriate for the job, but have a sufficiently similar function that they still do the job.
Overall comments:
I really liked this post!
I like it because priorly there didn’t exist a compelling story for the broad class of concepts to which it points. And I liked it for the name it gives to that broad class (‘deep deception’). I agree that it’s underappreciated that we’re still in trouble in the world where we (somehow) get good enough interpretability to monitor for and halt deceptive thoughts.
I just want to point out that I’ve written a long list of such obstacles in this article: Circumventing interpretability: How to defeat mind-readers
I believe the example of deep deception that Nate describes in this post is actually a combination of several methods described in that post.
I’ll quote the parts of this post that correspond to particular interpretability circumvention methods in the other post.
^ This is an example of what I called ‘representational kludging’. It’s an ugly way to represent “Be concerned about the monitors detecting suspicious instructions”, but it turns out to function similarly to such a representation.
It also involves ‘passively externalised representations’, since the observation that there is a delay in the instruction is a representational proxy for the (banned) observation that the monitors detected something suspicious about the instructions.
^ This is an example of the AI edging toward ‘thinking in the gaps in our tools’.
^ This is an additional example of ‘representational kludging’, since the AI is using representational tools that are nearby in embedding space but that aren’t exactly appropriate for the job, but have a sufficiently similar function that they still do the job.
Overall comments:
I really liked this post!
I like it because priorly there didn’t exist a compelling story for the broad class of concepts to which it points. And I liked it for the name it gives to that broad class (‘deep deception’). I agree that it’s underappreciated that we’re still in trouble in the world where we (somehow) get good enough interpretability to monitor for and halt deceptive thoughts.