Just to check I’m understanding correctly, in step 2, do you imagine the complex labelling process deferring to the simple process iff the simple process is correct (according to the complex process)? Assuming that we require precise agreement, something of that kind seems necessary to me.
I.e. the labelling process would be doing something like this:
# Return a pair of (simple, complex) labels for a given input
simple_label = GenerateSimpleLabel(input)
if is_correct(simple_label, input): return simple_label, simple_label else: return simple_label, GenerateComplexLabel(input)
Does that make sense?
A couple of typos: ”...we are only worried if the model [understands? knows?] the dynamics...” ″...don’t collect training data in situations without [where?] strong adversaries are trying...”
I think “deferring” was a bad word for me to use. I mostly imagine the complex labeling process will just independently label data, and then only include datapoints when there is agreement. That is, you’d just always return the (simple, complex) pair, and is-correct basically just tests whether they are equal.
I said “defer” because one of the data that the complex labeling process uses may be “what a human who was in the room said,” and this may sometimes be a really important source of evidence. But that really depends on how you set things up, if you have enough other signals then you would basically always just ignore that one.
(That said, I think probably amplification is the most important difference between the simple and complex labeling processes, because that’s the only scalable way to inject meaningful amounts of extra complexity into the complex labeling process—since the ML system can’t predict itself very well, it forces it to basically try to win a multiplayer game with copies of itself, and we hope that’s more complicated. And if that’s the case then the simple labeling process may as well use all of the data sources, and the difference is just how complex a judgment we are making using those inputs.)
...and is-correct basically just tests whether they are equal.
So here “equal” would presumably be “essentially equal in the judgement of complex process”, rather than verbatim equality of labels (the latter seems silly to me; if it’s not silly I must be missing something).
I think they need to be exactly equal. I think this is most likely accomplished by making something like pairwise judgments and only passing judgment when the comparison is a slam dunk (as discussed in section 3). Otherwise the instrumental policy will outperform the intended policy (since it will do the right thing when the simple labels are wrong).
Very interesting, thanks.
Just to check I’m understanding correctly, in step 2, do you imagine the complex labelling process deferring to the simple process iff the simple process is correct (according to the complex process)? Assuming that we require precise agreement, something of that kind seems necessary to me.
I.e. the labelling process would be doing something like this:
Does that make sense?
A couple of typos:
”...we are only worried if the model [understands? knows?] the dynamics...”
″...don’t collect training data in situations
without[where?] strong adversaries are trying...”I think “deferring” was a bad word for me to use. I mostly imagine the complex labeling process will just independently label data, and then only include datapoints when there is agreement. That is, you’d just always return the (simple, complex) pair, and is-correct basically just tests whether they are equal.
I said “defer” because one of the data that the complex labeling process uses may be “what a human who was in the room said,” and this may sometimes be a really important source of evidence. But that really depends on how you set things up, if you have enough other signals then you would basically always just ignore that one.
(That said, I think probably amplification is the most important difference between the simple and complex labeling processes, because that’s the only scalable way to inject meaningful amounts of extra complexity into the complex labeling process—since the ML system can’t predict itself very well, it forces it to basically try to win a multiplayer game with copies of itself, and we hope that’s more complicated. And if that’s the case then the simple labeling process may as well use all of the data sources, and the difference is just how complex a judgment we are making using those inputs.)
Ok, that all makes sense, thanks.
So here “equal” would presumably be “essentially equal in the judgement of complex process”, rather than verbatim equality of labels (the latter seems silly to me; if it’s not silly I must be missing something).
I think they need to be exactly equal. I think this is most likely accomplished by making something like pairwise judgments and only passing judgment when the comparison is a slam dunk (as discussed in section 3). Otherwise the instrumental policy will outperform the intended policy (since it will do the right thing when the simple labels are wrong).