I think the way around this is to make multiple roll-outs per model per problem. Get n different CoTs from Shoggoth, then for each of those get k different summarizations from Face. You then have n*k final answers. Optimal values for n and k probably depend on how expensive the roll-outs are. This population of answers allows you to usefully get a feedback signal about Shoggoth’s contribution vs Face’s contribution.
So, how do you extract the desired credit-assignment info out of this?
(Also, how do you then use that info?)
If I have 10 students but I can only test them in pairs, I can get some information on how good individual students are by testing multiple different pairings.
If I have just Alice and Bob, I pairing them up multiple times doesn’t help so much.
What do you mean? I don’t get what you are saying is convincing.
Perhaps I should clarify my belief.
The o1 report says the following (emphasis mine):
Is this damning in the sense of showing that o1 is a significant danger, which needs to be shut down for public safety?
No, I don’t think so.
Is this damning in the sense of providing significant evidence that the technology behind o1 is dangerous? That is: does it provide reason to condemn scaling up the methodology behind o1? Does it give us significant reason to think that scaled-up o1 would create significant danger to public safety?
This is trickier, but I say yes. The deceptive scheming could become much more capable as this technique is scaled up. I don’t think we have a strong enough understanding of why it was deceptive in the cases observed to rule out the development of more dangerous kinds of deception for similar reasons.
Is this damning in the sense that it shows OpenAI is dismissive of evidence of deception?
OpenAI should be commended for looking for deception in this way, and for publishing what they have about the deception they uncovered.
However, I don’t buy the distinction they draw in the o1 report about not finding instances of “purposefully trying to deceive the user for reasons other than satisfying the user request”. Providing fake URLs does not serve the purpose of satisfying the user request. We could argue all day about what it was “trying” to do, and whether it “understands” that fake URLs don’t satisfy the user. However, I maintain that it seems at least very plausible that o1 intelligently pursues a goal other than satisfying the user request; plausibly, “provide an answer that shallowly appears to satisfy the user request, even if you know the answer to be wrong” (probably due to the RL incentive).
More importantly, OpenAI’s overall behavior does not show concern about this deceptive behavior. It seems like they are judging deception case-by-case, rather than treating it as something to steer hard against in aggregate. This seems bad.