TurnTrout comments on TurnTrout’s shortform feed

TurnTrout 13 Nov 2021 5:12 UTC
LW: 4 AF: 4
AF
How might we align AGI without relying on interpretability?
I’m currently pessimistic about the prospect. But it seems worth thinking about, because wouldn’t it be such an amazing work-around?
My first idea straddles the border between contrived and intriguing. Consider some AGI-capable ML architecture, and imagine its $R^{n}$ parameter space being 3-colored as follows:
- Gray if the parameter vector+training process+other initial conditions leads to a nothingburger (a non-functional model)
- Red if the parameter vector+… leads to a misaligned or deceptive AI
- Blue if the learned network’s cognition is “safe” or “aligned” in some reasonable way
(This is a simplification, but let’s roll with it)
And then if you could somehow reason about which parts of $R^{n}$ weren’t red, you could ensure that no deception ever occurs. That is, you might have very little idea what cognition the learned network implements, but magically somehow you have strong a priori / theoretical reasoning which ensures that whatever the cognition is, it’s safe.
The contrived part is that you could just say “well, if we could wave a wand and produce an is-impact-aligned predicate, of course we could solve alignment.” True, true.
But the intriguing part is that it doesn’t seem totally impossible to me that we get some way of reasoning (at least statistically) about the networks and cognition produced by a given learning setup. See also: the power-seeking theorems, natural abstraction hypothesis, feature universality a la Olah’s circuits agenda...

TurnTrout comments on TurnTrout’s shortform feed

How might we align AGI without relying on interpretability?