FWIW, white box alignment doesn’t imply humans understand what the models are thinking. There are other ways to leverage the fact that we have access to the internals.
I was using it as a synonym for alignment with interpretability compared to without interpretability.
FWIW, white box alignment doesn’t imply humans understand what the models are thinking. There are other ways to leverage the fact that we have access to the internals.
I was using it as a synonym for alignment with interpretability compared to without interpretability.