If the problem is “humans don’t give good feedback”, then we can’t directly train agents to “help” with feedback; there’s nothing besides human feedback to give a signal of what’s “helping” in the first place. We can choose some proxy for what we think is helpful, but then that’s another crappy proxy which will break down under optimization pressure.
It’s not just about “fooling” humans, though that alone is a sufficient failure mode. Bear in mind that in order for “helping humans not be fooled” to be viable as a primary alignment strategy it must be the case that it’s easier to help humans not be fooled than to fool them in approximately all cases, because otherwise a hostile optimizer will head straight for the cases where humans are fallible. And I claim it is very obvious, from looking at existing real-world races between those trying to deceive and those trying to expose the deception, that there will be plenty of cases where the expose-deception side does not have a winning strategy.
The agent changing “human preferences” is another sufficient failure mode. The strategy of “design an agent that optimizes the hypothetical feedback that would have been given” is indeed a conceptually-valid way to solve that problem, and is notably not a direct feedback signal in the RL sense. At that point, we’re doing EU maximization, not reinforcement learning. We’re optimizing for expected utility from a fixed model, we’re not optimizing a feedback signal from the environment. Of course a bunch of the other problems of human feedback still carry over; “the hypothetical feedback a human would have given” is still a crappy proxy. But it’s a step in the right direction.
Sure, humans are sometimes inconsistent, and we don’t always know what we want (thanks for the references, that’s useful!). But I suspect we’re mainly inconsistent in borderline cases, which aren’t catastrophic to get wrong. I’m pretty sure humans would reliably state that they don’t want to be killed, or that lots of other people die, etc. And that when they have a specific task in mind , they state that they want the task done rather than not. All this subject to that they actually understand the main considerations for whatever plan or outcome is in question, but that is exactly what debate and rrm are for
If the problem is “humans don’t give good feedback”, then we can’t directly train agents to “help” with feedback; there’s nothing besides human feedback to give a signal of what’s “helping” in the first place. We can choose some proxy for what we think is helpful, but then that’s another crappy proxy which will break down under optimization pressure.
It’s not just about “fooling” humans, though that alone is a sufficient failure mode. Bear in mind that in order for “helping humans not be fooled” to be viable as a primary alignment strategy it must be the case that it’s easier to help humans not be fooled than to fool them in approximately all cases, because otherwise a hostile optimizer will head straight for the cases where humans are fallible. And I claim it is very obvious, from looking at existing real-world races between those trying to deceive and those trying to expose the deception, that there will be plenty of cases where the expose-deception side does not have a winning strategy.
The agent changing “human preferences” is another sufficient failure mode. The strategy of “design an agent that optimizes the hypothetical feedback that would have been given” is indeed a conceptually-valid way to solve that problem, and is notably not a direct feedback signal in the RL sense. At that point, we’re doing EU maximization, not reinforcement learning. We’re optimizing for expected utility from a fixed model, we’re not optimizing a feedback signal from the environment. Of course a bunch of the other problems of human feedback still carry over; “the hypothetical feedback a human would have given” is still a crappy proxy. But it’s a step in the right direction.
Sure, humans are sometimes inconsistent, and we don’t always know what we want (thanks for the references, that’s useful!). But I suspect we’re mainly inconsistent in borderline cases, which aren’t catastrophic to get wrong. I’m pretty sure humans would reliably state that they don’t want to be killed, or that lots of other people die, etc. And that when they have a specific task in mind , they state that they want the task done rather than not. All this subject to that they actually understand the main considerations for whatever plan or outcome is in question, but that is exactly what debate and rrm are for