I guess I’m considerably more optimistic on avoiding AI takeover without humans understanding what the models are thinking.
Basically this. I am a lot more pessimistic around black box alignment than I am around white box alignment.
FWIW, white box alignment doesn’t imply humans understand what the models are thinking. There are other ways to leverage the fact that we have access to the internals.
I was using it as a synonym for alignment with interpretability compared to without interpretability.
Basically this. I am a lot more pessimistic around black box alignment than I am around white box alignment.
FWIW, white box alignment doesn’t imply humans understand what the models are thinking. There are other ways to leverage the fact that we have access to the internals.
I was using it as a synonym for alignment with interpretability compared to without interpretability.