TurnTrout comments on A Longlist of Theories of Impact for Interpretability

TurnTrout 23 Aug 2022 19:10 UTC
LW: 11 AF: 8
0
AF
Thinking over the last few months, I came to most strongly endorse (2: Better prediction of future systems), or something close to it. I think that interpretability should adjudicate between competing theories of generalization and value formation in AIs (e.g. figure out whether and in what conditions a network learns a universal mesa objective, versus contextually activated objectives). Secondarily, figure out the mechanistic picture of how reward events form different kinds of cognition in a network (e.g. if I reward the agent for writing this line of code, what does the ensuing gradient mean, statistically, across training runs?).
Also, “is this model considering deceiving me?” doesn’t seem like that great of a question. Even an aligned AI would probably at least consider the plan of deceiving you, if that AI’s originating lab is dallying on letting it loose, meanwhile unaligned AIs are becoming increasingly capable around the world. Perhaps instead check if the AI is actively planning to kill you—that seems like better evidence on its alignment properties.
What links here?
- A shot at the diamond-alignment problem by TurnTrout (6 Oct 2022 18:29 UTC; 95 points)
- (Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need by Sodium (3 Oct 2024 19:11 UTC; 34 points)