Teacher-student training paradigms are not too uncommon. Essentially the teacher network is “better” than a human because you can generate far more feedback data and it can react at the same speed as the larger student network. Humans also can be inconsistent, etc.
What I was discussing is that currently with many systems (especially RL systems) we provide a simple feedback signal that is machine interpretable. For example, the “eggs” should be at coordinates x, y. But in reality, we don’t want the eggs at coordinates x, y we just want to make an omelet.
So, if we had a sufficiently complex teacher network it could understand what we want in human terms, and it could provide all the training signal we need to teach other student networks. In this situation, we may be able to get away with only ever fully mechanistically understanding the teacher network. If we know it is aligned, it can keep up and provide a sufficiently complex feedback signal to train any future students and make them aligned.
If this teacher network has a model of reality that models our morality and the complexity of the world then we don’t fall into the trap of having AI doing stupid things like killing everyone in the world to cure cancer. The teacher network’s feedback is sufficiently complex that it would never allow such actions to provide value in simulations, etc.
Teacher-student training paradigms are not too uncommon. Essentially the teacher network is “better” than a human because you can generate far more feedback data and it can react at the same speed as the larger student network. Humans also can be inconsistent, etc.
What I was discussing is that currently with many systems (especially RL systems) we provide a simple feedback signal that is machine interpretable. For example, the “eggs” should be at coordinates x, y. But in reality, we don’t want the eggs at coordinates x, y we just want to make an omelet.
So, if we had a sufficiently complex teacher network it could understand what we want in human terms, and it could provide all the training signal we need to teach other student networks. In this situation, we may be able to get away with only ever fully mechanistically understanding the teacher network. If we know it is aligned, it can keep up and provide a sufficiently complex feedback signal to train any future students and make them aligned.
If this teacher network has a model of reality that models our morality and the complexity of the world then we don’t fall into the trap of having AI doing stupid things like killing everyone in the world to cure cancer. The teacher network’s feedback is sufficiently complex that it would never allow such actions to provide value in simulations, etc.