I guess I’m considerably more optimistic on avoiding AI takeover without humans understanding what the models are thinking. (Or possibly you’re more optimistic about slowing down AI)
FWIW, white box alignment doesn’t imply humans understand what the models are thinking. There are other ways to leverage the fact that we have access to the internals.
I guess I’m considerably more optimistic on avoiding AI takeover without humans understanding what the models are thinking. (Or possibly you’re more optimistic about slowing down AI)
Basically this. I am a lot more pessimistic around black box alignment than I am around white box alignment.
FWIW, white box alignment doesn’t imply humans understand what the models are thinking. There are other ways to leverage the fact that we have access to the internals.
I was using it as a synonym for alignment with interpretability compared to without interpretability.