There are lots of ways to allow H to interface with an implicitly represented Z, but the one Paul describes in “Learning the Prior” is to train some model Mz(⋅,z) which represents Z implicitly by responding to human queries about Z (see also “Approval-maximizing representations” which describes how a model like Mz could represent Z implicitly as a tree).
Once H can interface with Z, checking whether some answer is correct given Z is at least no more difficult than producing an answer given Z—since H can just produce their answer then check it against the model’s using some distance metric (e.g. an autoregressive language model)—but could be much easier if there are ways for H to directly evaluate how likely H would be to produce that answer.
Right, but in the post the implicitly represented Z is used by an amplification or debate system, because it contains more information than a human can quickly read and use (so are you assuming it’s simple to verify the results of amplification/debate systems?)
Ah, sorry, no—I was assuming you were just using whatever procedure you used previously to allow the human to interface with Z in that situation as well. I’ll edit the post to be more clear there.
Okay, makes more sense now, now my understanding is that for question X, answer from ML system Y, amplification system A, verification in your quote is asking the A to answer “Would A(Z) output answer Y to question X?”, as opposed to asking A to answer “X”, and then checking if it equals “Y”. This can at most be as hard as running the original system, and maybe could be much more efficient.
Yep; that’s what I was imagining. It is also worth noting that it can be less safe to do that, though, since you’re letting A(Z) see Y, which could bias it in some way that you don’t want—I talk about that danger a bit in the context of approval-based amplification here and here.
There are lots of ways to allow H to interface with an implicitly represented Z, but the one Paul describes in “Learning the Prior” is to train some model Mz(⋅, z) which represents Z implicitly by responding to human queries about Z (see also “Approval-maximizing representations” which describes how a model like Mz could represent Z implicitly as a tree).
Once H can interface with Z, checking whether some answer is correct given Z is at least no more difficult than producing an answer given Z—since H can just produce their answer then check it against the model’s using some distance metric (e.g. an autoregressive language model)—but could be much easier if there are ways for H to directly evaluate how likely H would be to produce that answer.
Right, but in the post the implicitly represented Z is used by an amplification or debate system, because it contains more information than a human can quickly read and use (so are you assuming it’s simple to verify the results of amplification/debate systems?)
Ah, sorry, no—I was assuming you were just using whatever procedure you used previously to allow the human to interface with Z in that situation as well. I’ll edit the post to be more clear there.
Okay, makes more sense now, now my understanding is that for question X, answer from ML system Y, amplification system A, verification in your quote is asking the A to answer “Would A(Z) output answer Y to question X?”, as opposed to asking A to answer “X”, and then checking if it equals “Y”. This can at most be as hard as running the original system, and maybe could be much more efficient.
Yep; that’s what I was imagining. It is also worth noting that it can be less safe to do that, though, since you’re letting A(Z) see Y, which could bias it in some way that you don’t want—I talk about that danger a bit in the context of approval-based amplification here and here.