One further issue is that if the AI deduces this within one human-model (as in CIRL), it may follow this model off a metaphorical cliff when trying to maximize modeled reward.
Merely expanding the family of models isn’t enough because the best-predicting model is something like a microscopic, non-intentional model of the human. A “nearest unblocked model” problem. The solution should be similar—get the AI to score models so that the sort of model we want it to use is scored highly. (Or perhaps more complicated where human morality is undefined.) This isn’t just a prior—we want predictive quality to only be one of several (as yet ill-defined) criteria.
One further issue is that if the AI deduces this within one human-model (as in CIRL), it may follow this model off a metaphorical cliff when trying to maximize modeled reward.
Merely expanding the family of models isn’t enough because the best-predicting model is something like a microscopic, non-intentional model of the human. A “nearest unblocked model” problem. The solution should be similar—get the AI to score models so that the sort of model we want it to use is scored highly. (Or perhaps more complicated where human morality is undefined.) This isn’t just a prior—we want predictive quality to only be one of several (as yet ill-defined) criteria.