Great post, I’d be very excited for someone to try this out e.g. on DPO-ing/RLHF-ing base LLAMA, potentially starting from this previous work—“Localizing Lying in Llama”: https://twitter.com/mezaoptimizer/status/1729981499397603558. It’s also probably the most impactful model internals / interpretability project that I can currently think of.
Great post, I’d be very excited for someone to try this out e.g. on DPO-ing/RLHF-ing base LLAMA, potentially starting from this previous work—“Localizing Lying in Llama”: https://twitter.com/mezaoptimizer/status/1729981499397603558. It’s also probably the most impactful model internals / interpretability project that I can currently think of.
Thanks, that’s a great link to a very interesting paper. I’ve taken the liberty of adding it to the post, in case not everyone reads the comments.