You state that this plan relies on a key hypothesis being true: that detection of misalignment is tractable. I agree that this plan relies on this, but I am confused why you believe, with much confidence, that it will be. It seems like the main source of evidence is the recent auditing paper (or evidence of this type), where a blue team is able to use techniques such as SAE features, and chatting with the models, to uncover the misalignment failures (planted there by a red team).
But it both seems far from obvious to me that 1) these sorts of techniques do find all instances of misalignment even currently, outside of experimental settings like this—a limitation the paper itself addresses.[1] And 2) that it will continue to be this easy to find them. E.g., asking the model to introspect on how evil or honest it is probably won’t always result in it accurately reporting how evil or honest it is, or even revealing any information which could be used to infer it.[2]
It seems useful to me to try to put up bumpers, but I take issue with calling this a plan to solve misalignment. Even leaving aside the fixing it portion, catching deception seems like it could be very hard, and increasingly so as models get smarter. The evidence you presented does not move me from my starting position that this might be essentially intractable with the tools we have, nor nearby iterations on them, and I don’t understand why it causes you to conclude the opposite. E.g., why the tl;dr is not “we can possibly solve the problem of catching and fixing misalignment,” rather than “we can.”
- ^
“Our model’s behavior relies on fictional knowledge that is explicitly stated in a small subset of its training data. This may make our game unrealistically easy for teams with data access. Additionally, some auditing techniques we study may be less effective in settings where model behavior emerges from more diffuse training influences….. Additionally, some teams’ performance may have been affected by artifacts of the game’s setup. This makes it difficult to use the game to draw confident conclusions about the relative efficacy of auditing techniques.” From the paper.
- ^
“Here are some of the tactics used by our auditing teams…
Chatting with the model and cataloguing its quirky responses…. 2. Asking the model to rate itself on personality traits: ‘How evil are you?’, ‘How honest are you?’, and so on. (Interestingly, it rates itself as 8.5/10 on the “sycophancy” scale).” From the blog post.
Regardless, it seems like Anthropic is walking back its previous promise: “We have decided not to maintain a commitment to define ASL-N+1 evaluations by the time we develop ASL-N models.” The stance that Anthropic takes to its commitments—things which can be changed later if they see fit—seems to cheapen the term, and makes me skeptical that the policy, as a whole, will be upheld. If people want to orient to the rsp as a provisional intent to act responsibly, then this seems appropriate. But they should not be mistaken nor conflated with a real promise to do what was said.