I feel like the biggest issue with aligning powerful AI systems, is that nearly all the features we’d like these systems to have, like being corrigible, not being deceptive, having values aligned with ours etc, are properties we are currently unable to state formally. They are clearly real properties, like humans can agree on examples of non-corrigibility, misalignment, dishonest, when shown examples of actions AIs could take. But we can’t put them in code or a program specification, and consequently can’t reason about them very precisely, test whether systems have them or not etc
One reason I’m very bullish on mechinterp is that it seems like the only natural pathway towards making progress on this. Transformers trained with RLHF do have “tendencies” and proto-values in a sense, figuring out how those proto-desires are represented internally, really understanding it, I believe will shed a lot of light on how values form in transformers, will necessarily entail getting a solid formal framework for reasoning aobut these processes, and will put the notions of alignment on much firmer ground. Same goes for the other features. Models already show deceptive tendencies. In the process of developing deep mechinterp understanding of that, I believe we’d gain better understanding of how deception in a neural net can be modeled formally, which would allow us to reason about it infinitely better.
(I mean, someone 300IQ might come along and just galaxy brain all this from first principles, but quite galaxy brained people have tried already.. The point is that if mechinterp was developed to a sophisticated enough level, in addition to all the good things listed already, it would shed a lot of conceptual clarity on many of the key notions, which we are currently stuck reasoning about on an informal level, and I think we will get there through incremental progress, without having to hope someone just figures it out by thinking really hard and having an einstein-tier insight).
I feel like the biggest issue with aligning powerful AI systems, is that nearly all the features we’d like these systems to have, like being corrigible, not being deceptive, having values aligned with ours etc, are properties we are currently unable to state formally. They are clearly real properties, like humans can agree on examples of non-corrigibility, misalignment, dishonest, when shown examples of actions AIs could take. But we can’t put them in code or a program specification, and consequently can’t reason about them very precisely, test whether systems have them or not etc
One reason I’m very bullish on mechinterp is that it seems like the only natural pathway towards making progress on this. Transformers trained with RLHF do have “tendencies” and proto-values in a sense, figuring out how those proto-desires are represented internally, really understanding it, I believe will shed a lot of light on how values form in transformers, will necessarily entail getting a solid formal framework for reasoning aobut these processes, and will put the notions of alignment on much firmer ground. Same goes for the other features. Models already show deceptive tendencies. In the process of developing deep mechinterp understanding of that, I believe we’d gain better understanding of how deception in a neural net can be modeled formally, which would allow us to reason about it infinitely better.
(I mean, someone 300IQ might come along and just galaxy brain all this from first principles, but quite galaxy brained people have tried already.. The point is that if mechinterp was developed to a sophisticated enough level, in addition to all the good things listed already, it would shed a lot of conceptual clarity on many of the key notions, which we are currently stuck reasoning about on an informal level, and I think we will get there through incremental progress, without having to hope someone just figures it out by thinking really hard and having an einstein-tier insight).