Bogdan Ionut Cirstea comments on Interpreting the Learning of Deceit

Bogdan Ionut Cirstea 29 Dec 2023 15:50 UTC
4 points
1
Great post, I’d be very excited for someone to try this out e.g. on DPO-ing/RLHF-ing base LLAMA, potentially starting from this previous work—“Localizing Lying in Llama”: https://twitter.com/mezaoptimizer/status/1729981499397603558. It’s also probably the most impactful model internals / interpretability project that I can currently think of.
What links here?
- Interpreting the Learning of Deceit by RogerDearnaley (18 Dec 2023 8:12 UTC; 30 points)
- Bogdan Ionut Cirstea's comment on Bogdan Ionut Cirstea’s Shortform by Bogdan Ionut Cirstea (29 Dec 2023 17:26 UTC; 1 point)
- RogerDearnaley 31 Dec 2023 2:02 UTC
  2 points
  1
  Parent
  Thanks, that’s a great link to a very interesting paper. I’ve taken the liberty of adding it to the post, in case not everyone reads the comments.