The sharp left turn argument boils down to some handwavey analogy that evolution didn’t optimize humans to optimize for IGF—which is actually wrong, as it clearly did, combined with another handwavey argument that capabilities will fall into a natural generalization attractor, but there is no such attractor for alignment. That second component of the argument is also incorrect, because there is a natural known attractor for alignment—empowerment. The more plausible attractor argument is that selfish-empowerment is a stronger attractor than altruistic-empowerment.
Also more generally capabilities generalize through the world model, but any good utility function will also be defined through the world model, and thus can also benefit from its generalization.
any good utility function will also be defined through the world model, and thus can also benefit from its generalization.
Are you saying that as the world model gets more expressive/accurate/powerful, we somehow also get improved guarantees that the AI will become aligned with our values?
I’d agree with:
(i) As the world model improves, it becomes possible in principle to specify a utility function/goal for the AI which is closer to “our values”.
But I don’t see how that implies
(ii) As an AI’s (ANN-based) world model improves, we will in practice have any hope of understanding that world model and using it to direct the AI towards a goal that we can be remotely sure actually leads to good stuff, before that AI kills us.
Do you have some model/intuition of how (ii) might hold?
Are you saying that as the world model gets more expressive/accurate/powerful, we somehow also get improved guarantees that the AI will become aligned with our values?
Not quite—if I was saying that I would have. Instead I’d say that as the world model improves through training and improves its internal compression/grokking of the data, you can then also leverage this improved generalization to improve your utility function (which needs to reference concepts in the world model). You sort of have to do these updates anyway to not suffer from “ontological crisis”.
This same sort of dependency also arises for the model-free value estimators that any efficient model-based agent will have. Updates to the world model start to invalidate all the cached habitual action predictors—which is an issue for human brains as well, and we cope with it.
(ii) Isn’t automatic unless you’ve already learned an automatic procedure to identify/locate the target concepts in the world model that the utility function needs. This symbol grounding problem is really the core problem of alignment for ML/DL systems in the sense that having a robust solution to that is mostly sufficient. The brain also had to solve this problem, and we can learn a great deal from its solution.
The sharp left turn argument boils down to some handwavey analogy that evolution didn’t optimize humans to optimize for IGF—which is actually wrong, as it clearly did, combined with another handwavey argument that capabilities will fall into a natural generalization attractor, but there is no such attractor for alignment. That second component of the argument is also incorrect, because there is a natural known attractor for alignment—empowerment. The more plausible attractor argument is that selfish-empowerment is a stronger attractor than altruistic-empowerment.
Also more generally capabilities generalize through the world model, but any good utility function will also be defined through the world model, and thus can also benefit from its generalization.
Are you saying that as the world model gets more expressive/accurate/powerful, we somehow also get improved guarantees that the AI will become aligned with our values?
I’d agree with:
(i) As the world model improves, it becomes possible in principle to specify a utility function/goal for the AI which is closer to “our values”.
But I don’t see how that implies
(ii) As an AI’s (ANN-based) world model improves, we will in practice have any hope of understanding that world model and using it to direct the AI towards a goal that we can be remotely sure actually leads to good stuff, before that AI kills us.
Do you have some model/intuition of how (ii) might hold?
Not quite—if I was saying that I would have. Instead I’d say that as the world model improves through training and improves its internal compression/grokking of the data, you can then also leverage this improved generalization to improve your utility function (which needs to reference concepts in the world model). You sort of have to do these updates anyway to not suffer from “ontological crisis”.
This same sort of dependency also arises for the model-free value estimators that any efficient model-based agent will have. Updates to the world model start to invalidate all the cached habitual action predictors—which is an issue for human brains as well, and we cope with it.
(ii) Isn’t automatic unless you’ve already learned an automatic procedure to identify/locate the target concepts in the world model that the utility function needs. This symbol grounding problem is really the core problem of alignment for ML/DL systems in the sense that having a robust solution to that is mostly sufficient. The brain also had to solve this problem, and we can learn a great deal from its solution.