Quintin Pope comments on The Case for Radical Optimism about Interpretability

Quintin Pope 18 Dec 2021 21:29 UTC
1 point
My guess is the initial steps of getting the models to lie reliably are orders of magnitude harder than just asking a human to lie to you. Once you know how to prompt models into lying, it’s orders of magnitude easier to generate lots of lie data from the models.