I agree that modelling all human mistakes seems about as hard as modelling all of human values, so straightforward IRL is not a solution to the goal inference problem, only a reshuffling of complexity.
However, I don’t think modelling human mistakes is fundamental to the goal inference problem in the way this post claims.
For example, you can imagine goal inference being solved along the lines of extrapolated volition: we give humans progressively more information about the outcomes of their actions and time to deliberate, and let the AI try to generalize to the limit of a human with infinite information and deliberation time (including time to deliberate about what information to attend to). It’s unclear whether this limiting generalization would count as a sufficiently “reasonable representation” to solve the easy goal inference problem, but it’s quite possible that it solves the full goal inference problem.
Another way we can avoid modelling all human mistakes is if we don’t try to model all of human values, just the ones that are relevant to catastrophic / disempowering actions the AI could take. It seems plausible that there’s a pretty simple description of some human cognitive limitations which if addressed, would eliminate the vast majority of risk, even if it can’t help the AI decide whether the human would prefer to design a new city (to use Paul’s example) more like New York or more like Jakarta. This would also count as a good-enough solution to the goal inference problem that doesn’t require solving the “easy goal inference problem” in the full generality stated here.
I agree that modelling all human mistakes seems about as hard as modelling all of human values, so straightforward IRL is not a solution to the goal inference problem, only a reshuffling of complexity.
However, I don’t think modelling human mistakes is fundamental to the goal inference problem in the way this post claims.
For example, you can imagine goal inference being solved along the lines of extrapolated volition: we give humans progressively more information about the outcomes of their actions and time to deliberate, and let the AI try to generalize to the limit of a human with infinite information and deliberation time (including time to deliberate about what information to attend to). It’s unclear whether this limiting generalization would count as a sufficiently “reasonable representation” to solve the easy goal inference problem, but it’s quite possible that it solves the full goal inference problem.
Another way we can avoid modelling all human mistakes is if we don’t try to model all of human values, just the ones that are relevant to catastrophic / disempowering actions the AI could take. It seems plausible that there’s a pretty simple description of some human cognitive limitations which if addressed, would eliminate the vast majority of risk, even if it can’t help the AI decide whether the human would prefer to design a new city (to use Paul’s example) more like New York or more like Jakarta. This would also count as a good-enough solution to the goal inference problem that doesn’t require solving the “easy goal inference problem” in the full generality stated here.