Thinking over the last few months, I came to most strongly endorse (2: Better prediction of future systems), or something close to it. I think that interpretability should adjudicate between competing theories of generalization and value formation in AIs (e.g. figure out whether and in what conditions a network learns a universal mesa objective, versus contextually activated objectives). Secondarily, figure out the mechanistic picture of how reward events form different kinds of cognition in a network (e.g. if I reward the agent for writing this line of code, what does the ensuing gradient mean, statistically, across training runs?).
Also, “is this model considering deceiving me?” doesn’t seem like that great of a question. Even an aligned AI would probably at least consider the plan of deceiving you, if that AI’s originating lab is dallying on letting it loose, meanwhile unaligned AIs are becoming increasingly capable around the world. Perhaps instead check if the AI is actively planning to kill you—that seems like better evidence on its alignment properties.
Thinking over the last few months, I came to most strongly endorse (2: Better prediction of future systems), or something close to it. I think that interpretability should adjudicate between competing theories of generalization and value formation in AIs (e.g. figure out whether and in what conditions a network learns a universal mesa objective, versus contextually activated objectives). Secondarily, figure out the mechanistic picture of how reward events form different kinds of cognition in a network (e.g. if I reward the agent for writing this line of code, what does the ensuing gradient mean, statistically, across training runs?).
Also, “is this model considering deceiving me?” doesn’t seem like that great of a question. Even an aligned AI would probably at least consider the plan of deceiving you, if that AI’s originating lab is dallying on letting it loose, meanwhile unaligned AIs are becoming increasingly capable around the world. Perhaps instead check if the AI is actively planning to kill you—that seems like better evidence on its alignment properties.