Primer comments on The Case for Radical Optimism about Interpretability

Primer 18 Dec 2021 20:40 UTC
2 points
When interpreting human brains, we get plenty of excellent feedback. Calibrating a lie detector might be as easy as telling a few truths and a few lies while in an fMRI.

To be able to use similar approaches for interpreting AIs, it might be necessary to somehow get similar levels of feedback from the AIs. I notice I don’t have the slightest idea whether feedback from an AI was a few orders of magnitude harder to get—compared to getting human feedback—or whether it would be a few orders of magnitude easier or about the same.

Can we instruct GPT-3 to “consciously lie” to us?
- Quintin Pope 18 Dec 2021 21:29 UTC
  1 point
  Parent
  My guess is the initial steps of getting the models to lie reliably are orders of magnitude harder than just asking a human to lie to you. Once you know how to prompt models into lying, it’s orders of magnitude easier to generate lots of lie data from the models.