any good utility function will also be defined through the world model, and thus can also benefit from its generalization.
Are you saying that as the world model gets more expressive/accurate/powerful, we somehow also get improved guarantees that the AI will become aligned with our values?
I’d agree with:
(i) As the world model improves, it becomes possible in principle to specify a utility function/goal for the AI which is closer to “our values”.
But I don’t see how that implies
(ii) As an AI’s (ANN-based) world model improves, we will in practice have any hope of understanding that world model and using it to direct the AI towards a goal that we can be remotely sure actually leads to good stuff, before that AI kills us.
Do you have some model/intuition of how (ii) might hold?
Are you saying that as the world model gets more expressive/accurate/powerful, we somehow also get improved guarantees that the AI will become aligned with our values?
Not quite—if I was saying that I would have. Instead I’d say that as the world model improves through training and improves its internal compression/grokking of the data, you can then also leverage this improved generalization to improve your utility function (which needs to reference concepts in the world model). You sort of have to do these updates anyway to not suffer from “ontological crisis”.
This same sort of dependency also arises for the model-free value estimators that any efficient model-based agent will have. Updates to the world model start to invalidate all the cached habitual action predictors—which is an issue for human brains as well, and we cope with it.
(ii) Isn’t automatic unless you’ve already learned an automatic procedure to identify/locate the target concepts in the world model that the utility function needs. This symbol grounding problem is really the core problem of alignment for ML/DL systems in the sense that having a robust solution to that is mostly sufficient. The brain also had to solve this problem, and we can learn a great deal from its solution.
Are you saying that as the world model gets more expressive/accurate/powerful, we somehow also get improved guarantees that the AI will become aligned with our values?
I’d agree with:
(i) As the world model improves, it becomes possible in principle to specify a utility function/goal for the AI which is closer to “our values”.
But I don’t see how that implies
(ii) As an AI’s (ANN-based) world model improves, we will in practice have any hope of understanding that world model and using it to direct the AI towards a goal that we can be remotely sure actually leads to good stuff, before that AI kills us.
Do you have some model/intuition of how (ii) might hold?
Not quite—if I was saying that I would have. Instead I’d say that as the world model improves through training and improves its internal compression/grokking of the data, you can then also leverage this improved generalization to improve your utility function (which needs to reference concepts in the world model). You sort of have to do these updates anyway to not suffer from “ontological crisis”.
This same sort of dependency also arises for the model-free value estimators that any efficient model-based agent will have. Updates to the world model start to invalidate all the cached habitual action predictors—which is an issue for human brains as well, and we cope with it.
(ii) Isn’t automatic unless you’ve already learned an automatic procedure to identify/locate the target concepts in the world model that the utility function needs. This symbol grounding problem is really the core problem of alignment for ML/DL systems in the sense that having a robust solution to that is mostly sufficient. The brain also had to solve this problem, and we can learn a great deal from its solution.