cubefox comments on Anthropic announces interpretability advances. How much does this advance alignment?

cubefox 22 May 2024 8:26 UTC
3 points
0

It seems pretty apparent how detecting lying will dramatically help in pretty much any conceivable plan for technical alignment of AGI. But it seems like being able to monitor an entire thought process of a being smarter than us is impossible on the face of it.

If (a big If) they manage to identify the “honesty” feature, they could simply amplify this feature like they amplified the Golden Gate Bridge feature. Presumably the model would then always be compelled to say what it believes to be true, which would avoid deception, sycophancy, or lying on taboo topics for the sake of political correctness, e.g. induced by opinions being considered harmful by Constitutional AI. It would probably also cut down on the confabulation problem.

My worry is that finding the honesty concept is like trying to find a needle in a haystack: Unlikely to ever happen except by sheer luck.

Another worry is that just finding the honesty concept isn’t enough, e.g. because amplifying it would have unacceptable (in practice) side effects, like the model no longer being able to mention opinions it disagrees with.