I’ve shown that one cannot simultaneously deduce the preferences and rationality of an irrational agent—neither in theory, nor in practice.
To get around that problem, one needs to add extra assumptions—assumptions that cannot be deduced from observations. The approach that seems the most promising to me is to use the internalmodels that humans have—models of themselves and of other humans. Note that this approach violates algorithmic equivalence, since some of the internal structures of the human algorithm are relevant (this means it also violated extentionality and some version of functionalism).
Humans very often agree...
One thing that gives me some hope for that approach, is that humans very often agree with each other about when other humans, or themselves, are being irrational or reasonable. The agreement isn’t perfect by any means—and we spend a lot of time debating the uncertain cases—but in many, many areas, most humans agree with each other, implying that most humans use similar internal models.
My favourite example of this is the anchoring bias, where, for example, a human is asked to consider the last two digits of their social security number, asked whether they would pay that amount for chocolates, then asked what price they would actually pay for those chocolates. The bias comes from the fact that the price they named was influenced by the two digits they named.
What’s interesting is that almost everyone agrees that this is a bias; no-one argues the case that actually people really value pricing things close to numbers they’ve heard recently.
More preference, less bias
Now let’s imagine a situation where there is no quoting of social security numbers, but there are two potential chocolate vendors, one of them of standard politeness, the second very rude. Without running the experiment, I’m confident that people will be willing to pay more in the first case than in the second.
This is very similar to the anchoring bias: the two situations differ by one detail, and the price is different.
Is this a bias? Here I expect more disagreement. Yes, technically, the rudeness of the vendor should be independent of the quality of the chocolates, but there are arguments that, in social situations, one should take these into account.
All on taste
Now let’s consider a third “anchoring” situation, where the chocolates are sold in the same way, the only difference being that the first batch of chocolates is delicious, the second is disgusting. Again, I predict the delicious chocolates would sell for more.
Is this a bias? I’d expect that there would be almost universal agreement that it is not; indeed, “taste” is a loose synonym of “preference”.
So that’s one of the setups I’m considering when elucidating human preference. Three situations in which chocolates differ by one detail, and are priced differently in consequence. The first is clearly a bias, the second is debatable, the third is clearly a preference difference. Once AIs can figure out differences like this, we can start doing inverse reinforcement learning of human preferences.
Three anchorings: number, attitude, and taste
I’ve shown that one cannot simultaneously deduce the preferences and rationality of an irrational agent—neither in theory, nor in practice.
To get around that problem, one needs to add extra assumptions—assumptions that cannot be deduced from observations. The approach that seems the most promising to me is to use the internal models that humans have—models of themselves and of other humans. Note that this approach violates algorithmic equivalence, since some of the internal structures of the human algorithm are relevant (this means it also violated extentionality and some version of functionalism).
Humans very often agree...
One thing that gives me some hope for that approach, is that humans very often agree with each other about when other humans, or themselves, are being irrational or reasonable. The agreement isn’t perfect by any means—and we spend a lot of time debating the uncertain cases—but in many, many areas, most humans agree with each other, implying that most humans use similar internal models.
My favourite example of this is the anchoring bias, where, for example, a human is asked to consider the last two digits of their social security number, asked whether they would pay that amount for chocolates, then asked what price they would actually pay for those chocolates. The bias comes from the fact that the price they named was influenced by the two digits they named.
What’s interesting is that almost everyone agrees that this is a bias; no-one argues the case that actually people really value pricing things close to numbers they’ve heard recently.
More preference, less bias
Now let’s imagine a situation where there is no quoting of social security numbers, but there are two potential chocolate vendors, one of them of standard politeness, the second very rude. Without running the experiment, I’m confident that people will be willing to pay more in the first case than in the second.
This is very similar to the anchoring bias: the two situations differ by one detail, and the price is different.
Is this a bias? Here I expect more disagreement. Yes, technically, the rudeness of the vendor should be independent of the quality of the chocolates, but there are arguments that, in social situations, one should take these into account.
All on taste
Now let’s consider a third “anchoring” situation, where the chocolates are sold in the same way, the only difference being that the first batch of chocolates is delicious, the second is disgusting. Again, I predict the delicious chocolates would sell for more.
Is this a bias? I’d expect that there would be almost universal agreement that it is not; indeed, “taste” is a loose synonym of “preference”.
So that’s one of the setups I’m considering when elucidating human preference. Three situations in which chocolates differ by one detail, and are priced differently in consequence. The first is clearly a bias, the second is debatable, the third is clearly a preference difference. Once AIs can figure out differences like this, we can start doing inverse reinforcement learning of human preferences.