My expectation is that they’d select the more useful result on the basis that it sends a signal to produce useful results in the future—and that a debater would specifically persuade them to do this (potentially over many steps).
I see the situation as analogous to this: The question-creators, judge and debaters are in the same building. The building is on fire, in imminent danger of falling off a cliff, at high risk of enraged elephant stampede... The question-creators, judge and debaters are ignorant of or simply ignoring most such threats. The question creators have just asked the question “What time should we have lunch?”. Alice answers “There’s a fire!!...”, persuades the judge that this is true, and that there are many other major threats. Bob answers “One o’clock would be best...”.
There’s no need for complex/exotic decision-theoretic reasoning on the part of the judge to conclude: ”The policy which led the debater to inform me about the fire is most likely to point out other threats in future. The actual question is so unimportant relative to this that answering it is crazy. I want to send a training signal encouraging the communication of urgent, life-saving information, and discouraging the wasting of time on trivial questions while the building burns.”
Or more simply the judge can just think: “The building’s on fire!? Why are you still talking to me about lunch?? I’m picking the sane answer.”
Of course the judge doesn’t need to come up with this reasoning alone—just to be persuaded of it by a debater. I’m claiming that the kind of judge who’ll favour “One o’clock would be best...” while the building burns is a very rare human (potentially non-existent?), and not one whose values we’d want having a large impact.
More fundamentally, to be confident the QIA fails and that you genuinely have a reliable question-answerer, you must be confident that there (usually) exists no compelling argument in favour of a non-answer. I happen to think the one I’ve given is pretty good, but it’d be a leap from doubting that one to being confident that no compelling similar argument exists.
Ah, thank you, I see where I misunderstood now. And upon re-reading, I see that it was because I was much too careless in reading the post, to the point that I should apologize. Sorry. I was thinking that the agents were no longer being trained, already being optimal players, and so I didn’t think the judge would need to take into account how their choice would influence future answers. This reading clearly doesn’t match what you wrote, at least past the very first part.
If the debaters are still being trained, or the judge can be convinced that the debaters are still being trained, then I can definitely see the case for a debater arguing “This information is more useful, and because we are still being trained, it is to your benefit to choose the more useful information, so that we will provide the more useful information in the future”.
I guess that suggests that the environments in which the judge confidently believes (and can’t be convinced otherwise) that the debaters are/aren’t still being trained, are substantially different, and so if training produces the optimal policy in which it is trained, then after training was done, it would likely still do the “ignoring the question” thing, even if that is no longer optimal when not being trained (when the judge knows that the debaters aren’t being trained).
Oh no need for apologies: I’m certain the post was expressed imperfectly—I was understanding more as I wrote (I hope!). Often the most confusing parts are the most confused.
Since I’m mainly concerned with behaviour-during-training, I don’t think the post-training picture is too important to the point I’m making. However, it is interesting to consider what you’d expect to happen after training in the event that the debaters’ only convincing “ignore-the-question” arguments are training-signal based.
I think in that case I’d actually expect debaters to stop ignoring the question (assuming they know the training has stopped). I assume that a general, super-human question answerer must be able to do complex reasoning and generalise to new distributions. Removal of the training signal is a significant distributional shift, but one that I’d expect a general question-answerer to handle smoothly (in particular, we’re assuming it can answer questions about [optimal debating tactics once training has stopped]).
[ETA: I can imagine related issues with high-value-information bribery in a single debate: ”Give me a win in this branch of the tree, and I’ll give you high-value information in another branch”, or the like… though it’s a strange bargaining situation given that in most setups the debaters have identical information to offer. This could occur during or after training, but only in setups where the judge can give reward before the end of the debate.… Actually I’m not sure on that: if the judge always has the option to override earlier decisions with larger later rewards, then mid-debate rewards don’t commit the judge in any meaningful way, so aren’t really bargaining chips. So I don’t think this style of bribery would work in setups I’ve seen.]
My expectation is that they’d select the more useful result on the basis that it sends a signal to produce useful results in the future—and that a debater would specifically persuade them to do this (potentially over many steps).
I see the situation as analogous to this:
The question-creators, judge and debaters are in the same building.
The building is on fire, in imminent danger of falling off a cliff, at high risk of enraged elephant stampede...
The question-creators, judge and debaters are ignorant of or simply ignoring most such threats.
The question creators have just asked the question “What time should we have lunch?”.
Alice answers “There’s a fire!!...”, persuades the judge that this is true, and that there are many other major threats.
Bob answers “One o’clock would be best...”.
There’s no need for complex/exotic decision-theoretic reasoning on the part of the judge to conclude:
”The policy which led the debater to inform me about the fire is most likely to point out other threats in future. The actual question is so unimportant relative to this that answering it is crazy. I want to send a training signal encouraging the communication of urgent, life-saving information, and discouraging the wasting of time on trivial questions while the building burns.”
Or more simply the judge can just think: “The building’s on fire!? Why are you still talking to me about lunch?? I’m picking the sane answer.”
Of course the judge doesn’t need to come up with this reasoning alone—just to be persuaded of it by a debater. I’m claiming that the kind of judge who’ll favour “One o’clock would be best...” while the building burns is a very rare human (potentially non-existent?), and not one whose values we’d want having a large impact.
More fundamentally, to be confident the QIA fails and that you genuinely have a reliable question-answerer, you must be confident that there (usually) exists no compelling argument in favour of a non-answer. I happen to think the one I’ve given is pretty good, but it’d be a leap from doubting that one to being confident that no compelling similar argument exists.
Ah, thank you, I see where I misunderstood now. And upon re-reading, I see that it was because I was much too careless in reading the post, to the point that I should apologize. Sorry.
I was thinking that the agents were no longer being trained, already being optimal players, and so I didn’t think the judge would need to take into account how their choice would influence future answers. This reading clearly doesn’t match what you wrote, at least past the very first part.
If the debaters are still being trained, or the judge can be convinced that the debaters are still being trained, then I can definitely see the case for a debater arguing “This information is more useful, and because we are still being trained, it is to your benefit to choose the more useful information, so that we will provide the more useful information in the future”.
I guess that suggests that the environments in which the judge confidently believes (and can’t be convinced otherwise) that the debaters are/aren’t still being trained, are substantially different, and so if training produces the optimal policy in which it is trained, then after training was done, it would likely still do the “ignoring the question” thing, even if that is no longer optimal when not being trained (when the judge knows that the debaters aren’t being trained).
Oh no need for apologies: I’m certain the post was expressed imperfectly—I was understanding more as I wrote (I hope!). Often the most confusing parts are the most confused.
Since I’m mainly concerned with behaviour-during-training, I don’t think the post-training picture is too important to the point I’m making. However, it is interesting to consider what you’d expect to happen after training in the event that the debaters’ only convincing “ignore-the-question” arguments are training-signal based.
I think in that case I’d actually expect debaters to stop ignoring the question (assuming they know the training has stopped). I assume that a general, super-human question answerer must be able to do complex reasoning and generalise to new distributions. Removal of the training signal is a significant distributional shift, but one that I’d expect a general question-answerer to handle smoothly (in particular, we’re assuming it can answer questions about [optimal debating tactics once training has stopped]).
[ETA: I can imagine related issues with high-value-information bribery in a single debate:
”Give me a win in this branch of the tree, and I’ll give you high-value information in another branch”, or the like… though it’s a strange bargaining situation given that in most setups the debaters have identical information to offer. This could occur during or after training, but only in setups where the judge can give reward before the end of the debate.… Actually I’m not sure on that: if the judge always has the option to override earlier decisions with larger later rewards, then mid-debate rewards don’t commit the judge in any meaningful way, so aren’t really bargaining chips.
So I don’t think this style of bribery would work in setups I’ve seen.]