Wouldn’t the reward function “maximize action for this configuration of atoms” fit the data really well (given unrealistic computational power), but produce unhelpful prescriptions for behavior outside the training set? I’m not seeing how IRL dodges the problem, other than the human manipulating the algorithm (effectively choosing a prior).
What I read Qiaochu as saying is that the IRL model doesn’t have an ontology, and the world it lives in is one created by the ontology the programmer implicitly constructs for it based on choices about training data. Thus this problem doesn’t come up because the IRL model isn’t interacting with the whole world; only the parts of it the programmer thought relevant to solving the problem, and the model succeeds in part by how good a job the programmer did in picking what’s relevant.
This question feels confused to me but I’m having some difficulty precisely describing the nature of the confusion. When a human programmer sets up an IRL problem they get to choose what the domain of the reward function is. If the reward function is, for example, a function of the pixels of a video frame, IRL (hopefully) learns which video frames human drivers appear to prefer and which they don’t, based on which such preferences best reproduce driving data.
You might imagine that with unrealistic amounts of computational power IRL might attempt to understand what’s going on by modeling the underlying physics at the level of atoms, but that would be an astonishingly inefficient way to reproduce driving data even if it did work. IRL algorithms tend to have things like complexity penalties to make it possible to select e.g. a “simplest” reward function out of the many reward functions that could reproduce the data (this is a prior but a pretty reasonable and justifiable one as far as I can tell) and even with large amounts of computational power I expect it would still not be worth using a substantially more complicated reward function than necessary.
Problem is, if there’s a sufficiently large amount of sufficiently precise data, then the physically-correct model’s high accuracy is going to swamp the complexity penalty. That would be a ridiculously huge amount of data for atom-level physics, but there could be other abstraction levels which require less data but are still not what we want (e.g. gene-level reward functions, though that doesn’t fit the driving example very well).
Also, reliance on limited data seems like the sort of thing which is A Bad Idea for friendly AGI purposes.
if there’s a sufficiently large amount of sufficiently precise data, then the physically-correct model’s high accuracy is going to swamp the complexity penalty
An intuitive example of where it would fail: suppose we are rolling a (possibly weighted) die, but we model it as drawing numbered balls from a box without replacement. If we roll a bunch of sixes, then the model thinks the box now contains fewer sixes, so the chance of a six is lower. If we modeled the weighted die correctly, then a bunch of sixes is evidence that’s it’s weighted toward six, so the chance of six should be higher.
Takeaway: Bernstein-Von Mises typically fails in cases where we’re restricting ourselves to a badly inaccurate model. You can look at the exact conditions yourself; as a general rule, we want those conditions to hold. I don’t think it’s a significant issue for my argument.
We could set up the IRL algorithm so that atom-level simulation is outside the space of models it considers. That would break my argument. But a limitation on the model space like that raises other issues, especially for FAI.
Wouldn’t the reward function “maximize action for this configuration of atoms” fit the data really well (given unrealistic computational power), but produce unhelpful prescriptions for behavior outside the training set? I’m not seeing how IRL dodges the problem, other than the human manipulating the algorithm (effectively choosing a prior).
What I read Qiaochu as saying is that the IRL model doesn’t have an ontology, and the world it lives in is one created by the ontology the programmer implicitly constructs for it based on choices about training data. Thus this problem doesn’t come up because the IRL model isn’t interacting with the whole world; only the parts of it the programmer thought relevant to solving the problem, and the model succeeds in part by how good a job the programmer did in picking what’s relevant.
This question feels confused to me but I’m having some difficulty precisely describing the nature of the confusion. When a human programmer sets up an IRL problem they get to choose what the domain of the reward function is. If the reward function is, for example, a function of the pixels of a video frame, IRL (hopefully) learns which video frames human drivers appear to prefer and which they don’t, based on which such preferences best reproduce driving data.
You might imagine that with unrealistic amounts of computational power IRL might attempt to understand what’s going on by modeling the underlying physics at the level of atoms, but that would be an astonishingly inefficient way to reproduce driving data even if it did work. IRL algorithms tend to have things like complexity penalties to make it possible to select e.g. a “simplest” reward function out of the many reward functions that could reproduce the data (this is a prior but a pretty reasonable and justifiable one as far as I can tell) and even with large amounts of computational power I expect it would still not be worth using a substantially more complicated reward function than necessary.
Problem is, if there’s a sufficiently large amount of sufficiently precise data, then the physically-correct model’s high accuracy is going to swamp the complexity penalty. That would be a ridiculously huge amount of data for atom-level physics, but there could be other abstraction levels which require less data but are still not what we want (e.g. gene-level reward functions, though that doesn’t fit the driving example very well).
Also, reliance on limited data seems like the sort of thing which is A Bad Idea for friendly AGI purposes.
I don’t think that’s necessarily true?
Bernstein-Von Mises Theorem. It is indeed not always true, the theorem has some conditions.
An intuitive example of where it would fail: suppose we are rolling a (possibly weighted) die, but we model it as drawing numbered balls from a box without replacement. If we roll a bunch of sixes, then the model thinks the box now contains fewer sixes, so the chance of a six is lower. If we modeled the weighted die correctly, then a bunch of sixes is evidence that’s it’s weighted toward six, so the chance of six should be higher.
Takeaway: Bernstein-Von Mises typically fails in cases where we’re restricting ourselves to a badly inaccurate model. You can look at the exact conditions yourself; as a general rule, we want those conditions to hold. I don’t think it’s a significant issue for my argument.
We could set up the IRL algorithm so that atom-level simulation is outside the space of models it considers. That would break my argument. But a limitation on the model space like that raises other issues, especially for FAI.