But in fact, I expect the honest policy to get significantly less reward than the training-game-playing policy, because humans have large blind spots and biases affecting how they deliver rewards.
The difference in reward between truthfulness and the optimal policy depends on how humans allocate rewards, and perhaps it could be possible to find a clever strategy for allocating rewards such that truthfulness gets close to optimal reward.
For instance, in the (unrealistic) scenario in which a human has a well-specified and well-calibrated probability distribution over the state of the world, so that the actual state of the world (known to the AI) is randomly selected from this distribution, the most naive way to allocate rewards would be to make the loss be the log of the probability the human assigns to the answers given by the AI (so it gets better performance by giving higher-probability answers). This would disincentivize answering questions honestly if the human is often wrong. A better way to allocate rewards would be to ask a large number of questions about the state of the world, and, for each simple-to-describe property that an assignment of answers to each of these questions could have, which is extremely unlikely according to the human’s probability distribution (e.g. failing calibration tests), penalize assignments of answers to questions that satisfy this property. That way answering according to a random selection from the human’s probability distribution (which we’re modeling the actual state of the world as) will get high reward with high probability, while other simple-to-describe strategies for answering questions will likely have one of the penalized properties and get low reward.
Of course, this doesn’t work in real life because the state of the world isn’t randomly selected from human’s beliefs. Human biases make it more difficult to make truthfulness get close to optimal reward, but not necessarily impossible. One possibility would be to only train on questions that the human evaluators are extremely confident of the correct answers to, in hopes that they can reliably reward the AI more for truthful answers than for untruthful ones. This has the drawback that there would be no training data for topics that humans are uncertain about, which might make it infeasible for the AI to learn about these topics. It sure seems hard to come up with a reward allocation strategy that allows questions on which the humans are uncertain in training but still makes truth-telling a not-extremely-far-from-optimal strategy, under realistic assumptions about how human beliefs relate to reality, but it doesn’t seem obviously impossible.
That said, I’m still skeptical that AIs can be trained to tell to truth (as opposed to say things that are believed by humans) by rewarding what seems like truth-telling, because I don’t share the intuition that truthfulness is a particularly natural strategy that will be easy for gradient descent to find. If it’s trained on questions in natural language that weren’t selected for being very precisely stated, then these questions will often involve fuzzy, complicated concepts that humans use because we find useful, even though they aren’t especially natural. Figuring out how to correctly answer these questions would require learning things about how humans understand the world, which is also what you need in order to exploit human error to get higher reward than truthfulness would get.
Yeah, I definitely agree with “this problem doesn’t seem obviously impossible,” at least to push on quantitatively. Seems like there are a bunch of tricks from “choosing easy questions humans are confident about” to “giving the human access to AI assistants / doing debate” to “devising and testing debiasing tools” (what kinds of argument patterns are systematically more likely to convince listeners of true things rather than false things and can we train AI debaters to emulate those argument patterns?) to “asking different versions of the AI the same question and checking for consistency.” I only meant to say that the gap is big in naive HFDT, under the “naive safety effort” assumption made in the post. I think non-naive efforts will quantitatively reduce the gap in reward between honest and dishonest policies, though probably there will still be some gap in which at-least-sometimes-dishonest strategies do better than always-honest strategies. But together with other advances like interpretability or a certain type of regularization we could maybe get gradient descent to overall favor honesty.
The difference in reward between truthfulness and the optimal policy depends on how humans allocate rewards, and perhaps it could be possible to find a clever strategy for allocating rewards such that truthfulness gets close to optimal reward.
For instance, in the (unrealistic) scenario in which a human has a well-specified and well-calibrated probability distribution over the state of the world, so that the actual state of the world (known to the AI) is randomly selected from this distribution, the most naive way to allocate rewards would be to make the loss be the log of the probability the human assigns to the answers given by the AI (so it gets better performance by giving higher-probability answers). This would disincentivize answering questions honestly if the human is often wrong. A better way to allocate rewards would be to ask a large number of questions about the state of the world, and, for each simple-to-describe property that an assignment of answers to each of these questions could have, which is extremely unlikely according to the human’s probability distribution (e.g. failing calibration tests), penalize assignments of answers to questions that satisfy this property. That way answering according to a random selection from the human’s probability distribution (which we’re modeling the actual state of the world as) will get high reward with high probability, while other simple-to-describe strategies for answering questions will likely have one of the penalized properties and get low reward.
Of course, this doesn’t work in real life because the state of the world isn’t randomly selected from human’s beliefs. Human biases make it more difficult to make truthfulness get close to optimal reward, but not necessarily impossible. One possibility would be to only train on questions that the human evaluators are extremely confident of the correct answers to, in hopes that they can reliably reward the AI more for truthful answers than for untruthful ones. This has the drawback that there would be no training data for topics that humans are uncertain about, which might make it infeasible for the AI to learn about these topics. It sure seems hard to come up with a reward allocation strategy that allows questions on which the humans are uncertain in training but still makes truth-telling a not-extremely-far-from-optimal strategy, under realistic assumptions about how human beliefs relate to reality, but it doesn’t seem obviously impossible.
That said, I’m still skeptical that AIs can be trained to tell to truth (as opposed to say things that are believed by humans) by rewarding what seems like truth-telling, because I don’t share the intuition that truthfulness is a particularly natural strategy that will be easy for gradient descent to find. If it’s trained on questions in natural language that weren’t selected for being very precisely stated, then these questions will often involve fuzzy, complicated concepts that humans use because we find useful, even though they aren’t especially natural. Figuring out how to correctly answer these questions would require learning things about how humans understand the world, which is also what you need in order to exploit human error to get higher reward than truthfulness would get.
Yeah, I definitely agree with “this problem doesn’t seem obviously impossible,” at least to push on quantitatively. Seems like there are a bunch of tricks from “choosing easy questions humans are confident about” to “giving the human access to AI assistants / doing debate” to “devising and testing debiasing tools” (what kinds of argument patterns are systematically more likely to convince listeners of true things rather than false things and can we train AI debaters to emulate those argument patterns?) to “asking different versions of the AI the same question and checking for consistency.” I only meant to say that the gap is big in naive HFDT, under the “naive safety effort” assumption made in the post. I think non-naive efforts will quantitatively reduce the gap in reward between honest and dishonest policies, though probably there will still be some gap in which at-least-sometimes-dishonest strategies do better than always-honest strategies. But together with other advances like interpretability or a certain type of regularization we could maybe get gradient descent to overall favor honesty.