Neel Nanda comments on Take 1: We’re not going to reverse-engineer the AI.

Neel Nanda 1 Dec 2022 22:43 UTC
LW: 11 AF: 2
6
AF

Before we even start a training run, we should try to have *actually good *abstract arguments about alignment properties of the AI. Interpretability work is easier if you’re just trying to check details relevant to those arguments, rather than trying to figure out the whole AI.

Thanks for the post! I particularly appreciated this point