Are you saying that as the world model gets more expressive/accurate/powerful, we somehow also get improved guarantees that the AI will become aligned with our values?
Not quite—if I was saying that I would have. Instead I’d say that as the world model improves through training and improves its internal compression/grokking of the data, you can then also leverage this improved generalization to improve your utility function (which needs to reference concepts in the world model). You sort of have to do these updates anyway to not suffer from “ontological crisis”.
This same sort of dependency also arises for the model-free value estimators that any efficient model-based agent will have. Updates to the world model start to invalidate all the cached habitual action predictors—which is an issue for human brains as well, and we cope with it.
(ii) Isn’t automatic unless you’ve already learned an automatic procedure to identify/locate the target concepts in the world model that the utility function needs. This symbol grounding problem is really the core problem of alignment for ML/DL systems in the sense that having a robust solution to that is mostly sufficient. The brain also had to solve this problem, and we can learn a great deal from its solution.
Not quite—if I was saying that I would have. Instead I’d say that as the world model improves through training and improves its internal compression/grokking of the data, you can then also leverage this improved generalization to improve your utility function (which needs to reference concepts in the world model). You sort of have to do these updates anyway to not suffer from “ontological crisis”.
This same sort of dependency also arises for the model-free value estimators that any efficient model-based agent will have. Updates to the world model start to invalidate all the cached habitual action predictors—which is an issue for human brains as well, and we cope with it.
(ii) Isn’t automatic unless you’ve already learned an automatic procedure to identify/locate the target concepts in the world model that the utility function needs. This symbol grounding problem is really the core problem of alignment for ML/DL systems in the sense that having a robust solution to that is mostly sufficient. The brain also had to solve this problem, and we can learn a great deal from its solution.