Thinking more about ELK. Work in progress, so I expect I will eventually figure out what’s up with this.
Right now it seems to me that Safety via Debate would elicit compact/non-obfuscated knowledge.
So the basic scenario is that in addition to SmartVault, you’d have Barrister_Approve and Barrister_Disapprove, who are trying to share evidence/reasoning which makes the human approve or disapprove of SmartVault scenarios.
The biggest weakness of this that I know of is Obfuscated Arguments—that is, it won’t elicit obfuscated knowledge.
It seems like in the ELK example scenario they’re trying to elicit knowledge that is not obfuscated.
The nice thing about this is that Barrister_Approve and Barrister_Disapprove both have pretty straightforward incentives.
Paul was an author of the debate paper so I don’t think he missed this—more like I’m failing to figure out what’s up with the SmartVault scenario, and the current set of counterexamples.
Current possibilities:
ELK is actually a problem about eliciting obfuscated information, and the current examples about eliciting non-obfuscated information are just to make a simpler thought experiment
Even if the latent knowledge was not obfuscated, the opposing barrister could make an obfuscated argument against it.
This seems easily treatable by the human just disbelieving any argument that is obfuscated-to-them.
I think we would be trying to elicit obfuscated knowledge in ELK. In our examples, you can imagine that the predictor’s Bayes net works “just because”, so an argument that is convincing to a human for why the diamond in the room has to be arguing that the Bayes net is a good explanation of reality + arguing that it implies the diamond is in the room, which is the sort of “obfuscated” knowledge that debate can’t really handle.
Re-reading the ELK proposal—it seems like the latent knowledge you want to elicit is not-obfuscated.
Like, the situation to solve is that there is a piece of non-obfuscated information, which, if the human knew it, would change their mind about approval.
How do you expect solutions to elicit latent obfuscated knowledge (like ‘the only true explanation is incomprehendible by the human’ situations)?
I don’t think I understand your distinction between obfuscated and non-obfuscated knowledge. I generally think of non-obfuscated knowledge as NP or PSPACE. The human judgement of a situation might only theoretically require a poly sized fragment of a exp sized computation, but there’s no poly sized proof that this poly sized fragment is the correct fragment, and there are different poly sized fragments for which the human will evaluate differently, so I think of ELK as trying to elicit obfuscated knowledge.
So if there are different poly fragments that the human would evaluate differently, is ELK just “giving them a fragment such that they come to the correct conclusion” even if the fragment might not be the right piece.
E.g. in the SmartVault case, if the screen was put in the way of the camera and the diamond was secretly stolen, we would still be successful even if we didn’t elicit that fact, but instead elicited some poly fragment that got the human to answer disapprove?
Like the thing that seems weird to me here is that you can’t simultaneously require that the elicited knowledge be ‘relevant’ and ‘comprehensible’ and also cover these sorts of obfuscated debate like scenarios.
Does it seem right to you that ELK is about eliciting latent knowledge that causes an update in the correct direction, regardless of whether that knowledge is actually relevant?
I feel mostly confused by the way that things are being framed. ELK is about the human asking for various poly-sized fragments and the model reporting what those actually were instead of inventing something else. The model should accurately report all poly-sized fragments the human knows how to ask for.
Like the thing that seems weird to me here is that you can’t simultaneously require that the elicited knowledge be ‘relevant’ and ‘comprehensible’ and also cover these sorts of obfuscated debate like scenarios.
I don’t know what you mean by “relevant” or “comprehensible” here.
Does it seem right to you that ELK is about eliciting latent knowledge that causes an update in the correct direction, regardless of whether that knowledge is actually relevant?
I feel mostly confused by the way that things are being framed. ELK is about the human asking for various poly-sized fragments and the model reporting what those actually were instead of inventing something else. The model should accurately report all poly-sized fragments the human knows how to ask for.
I think this is what I was missing. I was incorrectly thinking of the system as generating poly-sized fragments.
Thinking more about ELK. Work in progress, so I expect I will eventually figure out what’s up with this.
Right now it seems to me that Safety via Debate would elicit compact/non-obfuscated knowledge.
So the basic scenario is that in addition to SmartVault, you’d have Barrister_Approve and Barrister_Disapprove, who are trying to share evidence/reasoning which makes the human approve or disapprove of SmartVault scenarios.
The biggest weakness of this that I know of is Obfuscated Arguments—that is, it won’t elicit obfuscated knowledge.
It seems like in the ELK example scenario they’re trying to elicit knowledge that is not obfuscated.
The nice thing about this is that Barrister_Approve and Barrister_Disapprove both have pretty straightforward incentives.
Paul was an author of the debate paper so I don’t think he missed this—more like I’m failing to figure out what’s up with the SmartVault scenario, and the current set of counterexamples.
Current possibilities:
ELK is actually a problem about eliciting obfuscated information, and the current examples about eliciting non-obfuscated information are just to make a simpler thought experiment
Even if the latent knowledge was not obfuscated, the opposing barrister could make an obfuscated argument against it.
This seems easily treatable by the human just disbelieving any argument that is obfuscated-to-them.
I think we would be trying to elicit obfuscated knowledge in ELK. In our examples, you can imagine that the predictor’s Bayes net works “just because”, so an argument that is convincing to a human for why the diamond in the room has to be arguing that the Bayes net is a good explanation of reality + arguing that it implies the diamond is in the room, which is the sort of “obfuscated” knowledge that debate can’t really handle.
Okay now I have to admit I am confused.
Re-reading the ELK proposal—it seems like the latent knowledge you want to elicit is not-obfuscated.
Like, the situation to solve is that there is a piece of non-obfuscated information, which, if the human knew it, would change their mind about approval.
How do you expect solutions to elicit latent obfuscated knowledge (like ‘the only true explanation is incomprehendible by the human’ situations)?
I don’t think I understand your distinction between obfuscated and non-obfuscated knowledge. I generally think of non-obfuscated knowledge as NP or PSPACE. The human judgement of a situation might only theoretically require a poly sized fragment of a exp sized computation, but there’s no poly sized proof that this poly sized fragment is the correct fragment, and there are different poly sized fragments for which the human will evaluate differently, so I think of ELK as trying to elicit obfuscated knowledge.
So if there are different poly fragments that the human would evaluate differently, is ELK just “giving them a fragment such that they come to the correct conclusion” even if the fragment might not be the right piece.
E.g. in the SmartVault case, if the screen was put in the way of the camera and the diamond was secretly stolen, we would still be successful even if we didn’t elicit that fact, but instead elicited some poly fragment that got the human to answer disapprove?
Like the thing that seems weird to me here is that you can’t simultaneously require that the elicited knowledge be ‘relevant’ and ‘comprehensible’ and also cover these sorts of obfuscated debate like scenarios.
Does it seem right to you that ELK is about eliciting latent knowledge that causes an update in the correct direction, regardless of whether that knowledge is actually relevant?
I feel mostly confused by the way that things are being framed. ELK is about the human asking for various poly-sized fragments and the model reporting what those actually were instead of inventing something else. The model should accurately report all poly-sized fragments the human knows how to ask for.
I don’t know what you mean by “relevant” or “comprehensible” here.
This doesn’t seem right to me.
Thanks for taking the time to explain this!
I think this is what I was missing. I was incorrectly thinking of the system as generating poly-sized fragments.
Cool, this makes sense to me.
My research agenda is basically about making a not-obfuscated model, so maybe I should just write that up as an ELK proposal then.