Interesting concept. If we have interpretability tools sufficient to check whether a model is aligned, what is gained by having the model use these tools to verify its alignment?
Other ideas for how you can use such an introspective check to keep your model aligned:
Nevermind, I figured it out. It’s use is to get SGD to update your model in the right direction. The above 3 uses only allow you to tell whether your model is unaligned, not ncessarily how to keep it aligned. This idea seems very cool!
Interesting concept. If we have interpretability tools sufficient to check whether a model is aligned, what is gained by having the model use these tools to verify its alignment?
Other ideas for how you can use such an introspective check to keep your model aligned:
Use an automated, untrained, system
Use a human
Use a previous version of the model
Nevermind, I figured it out. It’s use is to get SGD to update your model in the right direction. The above 3 uses only allow you to tell whether your model is unaligned, not ncessarily how to keep it aligned. This idea seems very cool!