How might we align AGI without relying on interpretability?
I’m currently pessimistic about the prospect. But it seems worth thinking about, because wouldn’t it be such an amazing work-around?
My first idea straddles the border between contrived and intriguing. Consider some AGI-capable ML architecture, and imagine its Rn parameter space being 3-colored as follows:
Gray if the parameter vector+training process+other initial conditions leads to a nothingburger (a non-functional model)
Red if the parameter vector+… leads to a misaligned or deceptive AI
Blue if the learned network’s cognition is “safe” or “aligned” in some reasonable way
(This is a simplification, but let’s roll with it)
And then if you could somehow reason about which parts of Rnweren’t red, you could ensure that no deception ever occurs. That is, you might have very little idea what cognition the learned network implements, but magically somehow you have strong a priori / theoretical reasoning which ensures that whatever the cognition is, it’s safe.
The contrived part is that you could just say “well, if we could wave a wand and produce an is-impact-aligned predicate, of course we could solve alignment.” True, true.
But the intriguing part is that it doesn’t seem totally impossible to me that we get some way of reasoning (at least statistically) about the networks and cognition produced by a given learning setup. See also: the power-seeking theorems, natural abstraction hypothesis, feature universality a la Olah’s circuits agenda...
How might we align AGI without relying on interpretability?
I’m currently pessimistic about the prospect. But it seems worth thinking about, because wouldn’t it be such an amazing work-around?
My first idea straddles the border between contrived and intriguing. Consider some AGI-capable ML architecture, and imagine its Rn parameter space being 3-colored as follows:
Gray if the
parameter vector+training process+other initial conditions
leads to a nothingburger (a non-functional model)Red if the parameter vector+… leads to a misaligned or deceptive AI
Blue if the learned network’s cognition is “safe” or “aligned” in some reasonable way
(This is a simplification, but let’s roll with it)
And then if you could somehow reason about which parts of Rn weren’t red, you could ensure that no deception ever occurs. That is, you might have very little idea what cognition the learned network implements, but magically somehow you have strong a priori / theoretical reasoning which ensures that whatever the cognition is, it’s safe.
The contrived part is that you could just say “well, if we could wave a wand and produce an
is-impact-aligned
predicate, of course we could solve alignment.” True, true.But the intriguing part is that it doesn’t seem totally impossible to me that we get some way of reasoning (at least statistically) about the networks and cognition produced by a given learning setup. See also: the power-seeking theorems, natural abstraction hypothesis, feature universality a la Olah’s circuits agenda...