This proposal looks really promising to me. This might be obvious to everyone, but I think much better interpretability research is really needed to make this possible in a safe(ish) way. (To verify the shard does develop, isn’t misaligned, etc.) We’d just need to avoid the temptation to take the fancy introspection and interpretability tools this would require and use them as optimization targets, which would obviously make them useless as safeguards.
This proposal looks really promising to me. This might be obvious to everyone, but I think much better interpretability research is really needed to make this possible in a safe(ish) way. (To verify the shard does develop, isn’t misaligned, etc.) We’d just need to avoid the temptation to take the fancy introspection and interpretability tools this would require and use them as optimization targets, which would obviously make them useless as safeguards.