porby comments on How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

porby 15 Dec 2022 19:50 UTC
1 point
3
Thanks for doing this research! The paper was one of those rare brow-raisers. I had suspected there was a way to do something like this, but I was significantly off in my estimation of its accessibility.
While I’ve still got major concerns about being able to do something like this on a strong and potentially adversarial model, it does seem like a good existence proof for any model that isn’t actively fighting back (like simulators or any other goal agnostic architecture). It’s a sufficiently strong example that it actually forced my doomchances down a bit, so yay!