Zach Furman comments on A shot at the diamond-alignment problem

Zach Furman 6 Oct 2022 22:26 UTC
7 points
5
This proposal looks really promising to me. This might be obvious to everyone, but I think much better interpretability research is really needed to make this possible in a safe(ish) way. (To verify the shard does develop, isn’t misaligned, etc.) We’d just need to avoid the temptation to take the fancy introspection and interpretability tools this would require and use them as optimization targets, which would obviously make them useless as safeguards.