What do you think about the following sort of interpretability?
You and I are neural nets. We give each other access to our code, so that we can simulate each other. However, we are only about human-level intelligence, so we can’t really interpret each other—we can’t look at the simulated brain and say “Ah yes, it intends to kill me later.” So what we do instead is construct hypothetical scenarios and simulate each other being in those scenarios, to see what they’d do. E.g. I simulate you in a scenario in which you have an opportunity to betray me.
What do you think about the following sort of interpretability?
You and I are neural nets. We give each other access to our code, so that we can simulate each other. However, we are only about human-level intelligence, so we can’t really interpret each other—we can’t look at the simulated brain and say “Ah yes, it intends to kill me later.” So what we do instead is construct hypothetical scenarios and simulate each other being in those scenarios, to see what they’d do. E.g. I simulate you in a scenario in which you have an opportunity to betray me.