Intuitively, the reason that our biases are biases and not a different reward function is because:
I would be happy to get rid of my biases, in that I would accept a well-designed self-modification that removed my biases. (“well-designed” is hiding a lot of complexity, but the point is just that such a self-modification exists.)
The bias applies across a variety of different scenarios with very different reward functions.
The first point suggests that you require that the human values are reflectively stable. In particular, if the human would choose action a in state s under the (m, R) pair, then they should also say that they would choose action a if they were in state s even when you explain what the consequences of action a will be. This is not a good solution—when people’s speech and people’s behavior disagree, it’s certainly possible that the behavior actually reflects their values and not the speech—but something along these lines seems important.
I’m more interested in the second point though. Let’s consider the setting where you have n different tasks for which you observe the human policy. After running IRL, you have a single rationality model M and multiple rewards R_1 … R_n. Intuitively, the rationality model M is better if the expected complexity of the inferred reward for a new unseen task is lower. That is, if you sample a new task T_{n+1} from the distribution of tasks and run IRL to estimate R_{n+1} using the existing learned rationality model M, you can expect that R_{n+1} will be simple.
What I’m trying to get at here is that the correct rationality model has a lot more explanatory power. Kolmogorov complexity doesn’t really capture that.
Under this definition, it seems likely that you could only get (M(0), R(0)) and (M(4), R(4)), at least out of the compatible pairs you suggested. To break this last tie, perhaps you could add in an assumption that humans are closer to rational than anti-rational on simple tasks.
I do agree that in the fully general case where we observe the full policy for all of human behavior and want to determine all of human values, things get murkier. Some possible answers in this scenario:
We put a strong prior on humans making plans hierarchically. This could bring us back to the case where we have multiple tasks.
Assume humans are optimal given constraints on their resources (that is, bounded rationality). Then, we only need to infer a reward function and not a rationality model. It is far from obvious that this is anywhere close to accurate as a model of humans, but it seems plausible enough to warrant investigation.
Both of these answers feel very unsatisfying to me though—they feel like hacks that don’t model reality perfectly.
Side note: How do I set my username? I logged in with Facebook and it never asked me for my name (Rohin Shah) and now I’m just “user 264”.
Intuitively, the reason that our biases are biases and not a different reward function is because:
I would be happy to get rid of my biases, in that I would accept a well-designed self-modification that removed my biases. (“well-designed” is hiding a lot of complexity, but the point is just that such a self-modification exists.)
The bias applies across a variety of different scenarios with very different reward functions.
The first point suggests that you require that the human values are reflectively stable. In particular, if the human would choose action a in state s under the (m, R) pair, then they should also say that they would choose action a if they were in state s even when you explain what the consequences of action a will be. This is not a good solution—when people’s speech and people’s behavior disagree, it’s certainly possible that the behavior actually reflects their values and not the speech—but something along these lines seems important.
I’m more interested in the second point though. Let’s consider the setting where you have n different tasks for which you observe the human policy. After running IRL, you have a single rationality model M and multiple rewards R_1 … R_n. Intuitively, the rationality model M is better if the expected complexity of the inferred reward for a new unseen task is lower. That is, if you sample a new task T_{n+1} from the distribution of tasks and run IRL to estimate R_{n+1} using the existing learned rationality model M, you can expect that R_{n+1} will be simple.
What I’m trying to get at here is that the correct rationality model has a lot more explanatory power. Kolmogorov complexity doesn’t really capture that.
Under this definition, it seems likely that you could only get (M(0), R(0)) and (M(4), R(4)), at least out of the compatible pairs you suggested. To break this last tie, perhaps you could add in an assumption that humans are closer to rational than anti-rational on simple tasks.
I do agree that in the fully general case where we observe the full policy for all of human behavior and want to determine all of human values, things get murkier. Some possible answers in this scenario:
We put a strong prior on humans making plans hierarchically. This could bring us back to the case where we have multiple tasks.
Assume humans are optimal given constraints on their resources (that is, bounded rationality). Then, we only need to infer a reward function and not a rationality model. It is far from obvious that this is anywhere close to accurate as a model of humans, but it seems plausible enough to warrant investigation.
Both of these answers feel very unsatisfying to me though—they feel like hacks that don’t model reality perfectly.
Side note: How do I set my username? I logged in with Facebook and it never asked me for my name (Rohin Shah) and now I’m just “user 264”.