It seems like you’ve ignored the possibility of importance sampling?
More broadly if this ends up being a problem it’s basically an exploration problem that I expect we can solve with simple ML tricks. E.g. you could include an entropy bonus so that the agents are incentivized to say different things, and anneal that away as training progresses.
his point was that the apparent simplicity of the usual setup is actually hiding something, because you don’t really get anything out of the assumption that the two players are answering the same question.
Sure? I feel like the argument for safety is that you have two equally-matched players that are incentivized to find flaws in each other’s arguments, which is also true in Scott’s proposal. It doesn’t feel to me like that argument for safety depended much on them answering the same question.
(I feel like I’m restating what you said, I guess I’m confused why you interpret this as evidence that the simplicity of the setup is “hiding something”.)
It seems like you’ve ignored the possibility of importance sampling?
Ah, right, I agree. I forgot about that suggestion as I was writing. It seems likely some version of this would work.
(I feel like I’m restating what you said, I guess I’m confused why you interpret this as evidence that the simplicity of the setup is “hiding something”.)
Yep, sorry, I think you should take that as something-about-Scott’s-point-abram-didn’t-explain. I still disclaim myself as maybe missing part of Scott’s point. But: what the simpler setup is “hiding” is the complexity of comparing answers:
The complexity of determining whether two claims are “different”.
The complexity of determining whether two claims are mutually exclusive.
The complexity of comparing the quality of different arguments, when the different answers may be expressed in very different ontologies, and deal with very difficult-to-compare considerations.
Making the two sides defend entirely unrelated claims makes all this obvious. In addition, it makes the first two bullet points irrelevant, removing a “fake difficulty” from the setup.
Okay, that all makes sense. One maybe-caveat-or-disagreement:
The complexity of comparing the quality of different arguments, when the different answers may be expressed in very different ontologies, and deal with very difficult-to-compare considerations.
I do think that answering the same question does make it meaningfully easier to compare answers, though I agree it’s still not obvious that it’s easy on some absolute scale for the reasons you outline.
It seems like you’ve ignored the possibility of importance sampling?
More broadly if this ends up being a problem it’s basically an exploration problem that I expect we can solve with simple ML tricks. E.g. you could include an entropy bonus so that the agents are incentivized to say different things, and anneal that away as training progresses.
Sure? I feel like the argument for safety is that you have two equally-matched players that are incentivized to find flaws in each other’s arguments, which is also true in Scott’s proposal. It doesn’t feel to me like that argument for safety depended much on them answering the same question.
(I feel like I’m restating what you said, I guess I’m confused why you interpret this as evidence that the simplicity of the setup is “hiding something”.)
Ah, right, I agree. I forgot about that suggestion as I was writing. It seems likely some version of this would work.
Yep, sorry, I think you should take that as something-about-Scott’s-point-abram-didn’t-explain. I still disclaim myself as maybe missing part of Scott’s point. But: what the simpler setup is “hiding” is the complexity of comparing answers:
The complexity of determining whether two claims are “different”.
The complexity of determining whether two claims are mutually exclusive.
The complexity of comparing the quality of different arguments, when the different answers may be expressed in very different ontologies, and deal with very difficult-to-compare considerations.
Making the two sides defend entirely unrelated claims makes all this obvious. In addition, it makes the first two bullet points irrelevant, removing a “fake difficulty” from the setup.
Okay, that all makes sense. One maybe-caveat-or-disagreement:
I do think that answering the same question does make it meaningfully easier to compare answers, though I agree it’s still not obvious that it’s easy on some absolute scale for the reasons you outline.