To add some more concreteness: suppose we open up the model and find that it’s basically just a giant k nearest neighbors (it obviously can’t be literally this, but this is easiest to describe as an analogy). Then this would explain why current alignment techniques work and dissolves some of the mystery of generalization. Then suppose we create AGI and we find that it does something very different internally that is more deeply entangled and we can’t really make sense of it because it’s too complicated. Then this would imo also provide strong evidence that we should expect our alignment techniques to break.
In other words, a load bearing assumption is that current models are fundamentally simple/modular in some sense that makes interpretability feasible, and that observing this breaking in the future is probably important evidence that will hopefully come before those future systems actually kill everyone.
To add some more concreteness: suppose we open up the model and find that it’s basically just a giant k nearest neighbors (it obviously can’t be literally this, but this is easiest to describe as an analogy). Then this would explain why current alignment techniques work and dissolves some of the mystery of generalization. Then suppose we create AGI and we find that it does something very different internally that is more deeply entangled and we can’t really make sense of it because it’s too complicated. Then this would imo also provide strong evidence that we should expect our alignment techniques to break.
In other words, a load bearing assumption is that current models are fundamentally simple/modular in some sense that makes interpretability feasible, and that observing this breaking in the future is probably important evidence that will hopefully come before those future systems actually kill everyone.