My greatest hopes for mechanistic interpretability do not seem represented, so allow me to present my pet direction.
You invest many resources in mechanistically understanding ONE teacher network, within a teacher-student training paradigm. This is valuable because now instead of presenting a simplistic proxy training signal, you can send an abstract signal with some understanding of the world. Such a signal is harder to “cheat” and “hack”.
If we can fully interpret and design that teacher network, then our training signals can incorporate much of our world model and morality. True this requires us to become philosophical and actually consider what such a world model and morality is… but at least in this case we have a technical direction. In such an instance a good deal of the technical aspects of the alignment problem is solved. (at least in aligning AI-to-human not human-to-human).
This argument says all mechanistic interpretability effort could be focused on ONE network. I concede this method requires the teacher to have a decent generalizable world model… At which point, perhaps we are already in the danger zone.
Could you say more? Why would a teacher network be more capable of training a student network than literal humans? By what mechanism do you expect this teacher network to train other networks in a way that benefits from us understanding its internals?
Teacher-student training paradigms are not too uncommon. Essentially the teacher network is “better” than a human because you can generate far more feedback data and it can react at the same speed as the larger student network. Humans also can be inconsistent, etc.
What I was discussing is that currently with many systems (especially RL systems) we provide a simple feedback signal that is machine interpretable. For example, the “eggs” should be at coordinates x, y. But in reality, we don’t want the eggs at coordinates x, y we just want to make an omelet.
So, if we had a sufficiently complex teacher network it could understand what we want in human terms, and it could provide all the training signal we need to teach other student networks. In this situation, we may be able to get away with only ever fully mechanistically understanding the teacher network. If we know it is aligned, it can keep up and provide a sufficiently complex feedback signal to train any future students and make them aligned.
If this teacher network has a model of reality that models our morality and the complexity of the world then we don’t fall into the trap of having AI doing stupid things like killing everyone in the world to cure cancer. The teacher network’s feedback is sufficiently complex that it would never allow such actions to provide value in simulations, etc.
My greatest hopes for mechanistic interpretability do not seem represented, so allow me to present my pet direction.
You invest many resources in mechanistically understanding ONE teacher network, within a teacher-student training paradigm. This is valuable because now instead of presenting a simplistic proxy training signal, you can send an abstract signal with some understanding of the world. Such a signal is harder to “cheat” and “hack”.
If we can fully interpret and design that teacher network, then our training signals can incorporate much of our world model and morality. True this requires us to become philosophical and actually consider what such a world model and morality is… but at least in this case we have a technical direction. In such an instance a good deal of the technical aspects of the alignment problem is solved. (at least in aligning AI-to-human not human-to-human).
This argument says all mechanistic interpretability effort could be focused on ONE network. I concede this method requires the teacher to have a decent generalizable world model… At which point, perhaps we are already in the danger zone.
Could you say more? Why would a teacher network be more capable of training a student network than literal humans? By what mechanism do you expect this teacher network to train other networks in a way that benefits from us understanding its internals?
Teacher-student training paradigms are not too uncommon. Essentially the teacher network is “better” than a human because you can generate far more feedback data and it can react at the same speed as the larger student network. Humans also can be inconsistent, etc.
What I was discussing is that currently with many systems (especially RL systems) we provide a simple feedback signal that is machine interpretable. For example, the “eggs” should be at coordinates x, y. But in reality, we don’t want the eggs at coordinates x, y we just want to make an omelet.
So, if we had a sufficiently complex teacher network it could understand what we want in human terms, and it could provide all the training signal we need to teach other student networks. In this situation, we may be able to get away with only ever fully mechanistically understanding the teacher network. If we know it is aligned, it can keep up and provide a sufficiently complex feedback signal to train any future students and make them aligned.
If this teacher network has a model of reality that models our morality and the complexity of the world then we don’t fall into the trap of having AI doing stupid things like killing everyone in the world to cure cancer. The teacher network’s feedback is sufficiently complex that it would never allow such actions to provide value in simulations, etc.