Marius Hobbhahn comments on The Defender’s Advantage of Interpretability

Marius Hobbhahn 12 Oct 2022 9:33 UTC
1 point
0
Just to be clear, I also think that your grokking work increases alignment much more than capabilities on balance.

I think the way in which it increases capabilities would roughly look like this: “your insight on grokking is a key to understanding fast generalization better; other people build on this insight and then modify training; this improves the speed of learning and thus capabilities”.

I think your work is clearly net positive, I just wanted to use a concrete example in the post to show that there are trade-offs worth taking.