In my book, this would probably be the most impactful model internals / interpretability project that I can think of: https://www.lesswrong.com/posts/FbSAuJfCxizZGpcHc/interpreting-the-learning-of-deceit?commentId=qByLyr6RSgv3GBqfB
In my book, this would probably be the most impactful model internals / interpretability project that I can think of: https://www.lesswrong.com/posts/FbSAuJfCxizZGpcHc/interpreting-the-learning-of-deceit?commentId=qByLyr6RSgv3GBqfB