Garrett Baker comments on [missing post]

Garrett Baker 16 Jun 2022 18:57 UTC
1 point
Interesting concept. If we have interpretability tools sufficient to check whether a model is aligned, what is gained by having the model use these tools to verify its alignment?
Other ideas for how you can use such an introspective check to keep your model aligned:
- Use an automated, untrained, system
- Use a human
- Use a previous version of the model
- Garrett Baker 16 Jun 2022 19:17 UTC
  1 point
  Parent
  Nevermind, I figured it out. It’s use is to get SGD to update your model in the right direction. The above 3 uses only allow you to tell whether your model is unaligned, not ncessarily how to keep it aligned. This idea seems very cool!