When interpreting human brains, we get plenty of excellent feedback. Calibrating a lie detector might be as easy as telling a few truths and a few lies while in an fMRI.
To be able to use similar approaches for interpreting AIs, it might be necessary to somehow get similar levels of feedback from the AIs. I notice I don’t have the slightest idea whether feedback from an AI was a few orders of magnitude harder to get—compared to getting human feedback—or whether it would be a few orders of magnitude easier or about the same.
My guess is the initial steps of getting the models to lie reliably are orders of magnitude harder than just asking a human to lie to you. Once you know how to prompt models into lying, it’s orders of magnitude easier to generate lots of lie data from the models.
When interpreting human brains, we get plenty of excellent feedback. Calibrating a lie detector might be as easy as telling a few truths and a few lies while in an fMRI.
To be able to use similar approaches for interpreting AIs, it might be necessary to somehow get similar levels of feedback from the AIs. I notice I don’t have the slightest idea whether feedback from an AI was a few orders of magnitude harder to get—compared to getting human feedback—or whether it would be a few orders of magnitude easier or about the same.
Can we instruct GPT-3 to “consciously lie” to us?
My guess is the initial steps of getting the models to lie reliably are orders of magnitude harder than just asking a human to lie to you. Once you know how to prompt models into lying, it’s orders of magnitude easier to generate lots of lie data from the models.