What seems to be necessary is that the LCDT thinks its decisions have no influence on the impact of other agents’ decisions, not simply on the decisions themselves (this relates to Steve’s second point). For example, let’s say you’re deciding whether to press button A or button B, and I rewire them so that B now has A’s consequences, and A B’s. I now assume that my action hasn’t influenced your decision, but it has influenced the consequences of your decision.
The causal graph here has both of us influencing a [buttons] node: I rewire them and you choose which to press. I’ve cut my link to you, but not to [buttons]. More generally, I can deceive you arbitrarily simply by anticipating your action and applying a post-action-adaptor to it (like re-wiring the buttons).
Perhaps the idea here is that I’d have no incentive to hide my interference with the buttons (since I assume it won’t change which you press). That seems to work for many cases, and so will be detectable/fixable in training—but after you apply a feedback loop of this sort you’ll be left with the action-adaptor-based deceptions which you don’t notice.
Thats… a situation I hadn’t thought about. My first reaction is that it fall in the category of “myopic defection” instead of deception. In the concrete example, as you say, you would reveal it to any overseer/observer because you don’t think anything you do would impact them (similarly to what happens in the (N,M)-deception problem).
Your point about what happens if you detect defection in training is actually a more general issue even in all examples: what do you do when you catch defection/bad incentives? LCDT isn’t trying to solve this problem, just to find a decision theory where you would always reveal such incentives in training. I personally don’t know how to solve this incentive question, and I agree it’s a concrete and crucial detail in aligning an AI.
It seems to me that to get the ”...and extract many valuable insights about its behavior”, there needs to be an assumption that Argmax-HCH is being simulated in a helpful/clear/transparent way. It’s not clear to me why this is expected: wouldn’t the same pressures that lead to a “trained model that we searched for acting like HCH” tending to be opaque also lead the simulation of Argmax-HCH to be opaque? Specifically, the LCDT agent only needs to run it, not understand it.
There’s two way to think about it.
If we’re talking about a literal LCDT agent (which is what I have in mind), then it would have a learned causal model of HCH good enough to predict what the final output is. That sounds more interpretable to me than just having an opaque implementation of HCH (but it’s not already interpreted for us).
If we’re talking about systems which act like an LCDT agent but are not literally programmed to do so, I’m not so sure. I expect that they need a somewhat flexible representation of what they’re trying to represent, but maybe I’m missing a clever trick.
My first reaction is that it fall in the category of “myopic defection” instead of deception.
Ok, yes—it does seem at least to be a somewhat different issue. I need to think about it more.
In the concrete example, as you say, you would reveal it to any overseer/observer because you don’t think anything you do would impact them
Yes, though I think the better way to put this is that I wouldn’t spend effort hiding it. It’s not clear I’d actively choose to reveal it, since there’s no incentive in either direction once I think I have no influence on your decision. (I do think this is ok, since it’s the active efforts to deceive we’re most worried about)
If we’re talking about a literal LCDT agent (which is what I have in mind), then it would have a learned causal model of HCH good enough to predict what the final output is.
Sure, but the case I’m thinking about is where the LCDT agent itself is little more than a wrapper around an opaque implementation of HCH. I.e. the LCDT agent’s causal model is essentially: [data] --> [Argmax HCH function] --> [action].
I assume this isn’t what you’re thinking of, but it’s not clear to me what constraints we’d apply to get the kind of thing you are thinking of. E.g. if our causal model is allowed to represent an individual human as a black-box, then why not HCH as a black-box? If we’re not allowing a human as a black-box, then how far must things be broken down into lower-level gears (at fine enough granularity I’m not sure a causal model is much clearer than a NN)?
Quite possibly there are sensible constraints we could apply to get an interpretable model. It’s just not currently clear to me what kind of thing you’re imagining—and I assume they’d come at some performance penalty.
Yes, though I think the better way to put this is that I wouldn’t spend effort hiding it. It’s not clear I’d actively choose to reveal it, since there’s no incentive in either direction once I think I have no influence on your decision. (I do think this is ok, since it’s the active efforts to deceive we’re most worried about)
Agreed
Sure, but the case I’m thinking about is where the LCDT agent itself is little more than a wrapper around an opaque implementation of HCH. I.e. the LCDT agent’s causal model is essentially: [data] --> [Argmax HCH function] --> [action].
I assume this isn’t what you’re thinking of, but it’s not clear to me what constraints we’d apply to get the kind of thing you are thinking of. E.g. if our causal model is allowed to represent an individual human as a black-box, then why not HCH as a black-box? If we’re not allowing a human as a black-box, then how far must things be broken down into lower-level gears (at fine enough granularity I’m not sure a causal model is much clearer than a NN)?
Quite possibly there are sensible constraints we could apply to get an interpretable model. It’s just not currently clear to me what kind of thing you’re imagining—and I assume they’d come at some performance penalty.
I need to think more about it, but my personal mental image is that to be competitive, the LCDT agent must split the human at a lower level than just one distribution (even more for HCH which is more complicated). As for why such a low-level causal model would be more interpretable than a NN:
First we know which part of the causal model correspond to the human, which is not the case in the NN
The human will be modeled only by variables on this part of the causal graph, whereas it could be completely distributed over a NN
I don’t know how to formulate it, but a causal model seems to give way more information than a NN, because it encodes the causal relationship, whereas a NN could completely compute causal relationships in a weird and counterintuitive way.
First we know which part of the causal model correspond to the human, which is not the case in the NN
This doesn’t follow only from [we know X is an LCDT agent that’s modeling a human] though, right? We could imagine some predicate/constraint/invariant that detects/enforces/maintains LCDTness without necessarily being transparent to humans. I’ll grant you it seems likely so long as we have the right kind of LCDT agent—but it’s not clear to me that LCDTness itself is contributing much here.
The human will be modeled only by variables on this part of the causal graph, whereas it could be completely distributed over a NN
At first sight this seems at least mostly right—but I do need to think about it more. E.g. it seems plausible that most of the work of modeling a particular human H fairly accurately is in modeling [humans-in-general] and then feeding H’s properties into that. The [humans-in-general] part may still be distributed. I agree that this is helpful. However, I do think it’s important not to assume things are so nicely spatially organised as they would be once you got down to a molecular level model.
a causal model seems to give way more information than a NN, because it encodes the causal relationship, whereas a NN could completely compute causal relationships in a weird and counterintuitive way
My intuitions are in the same direction as yours (I’m playing devil’s advocate a bit here—shockingly :)). I just don’t have principled reasons to think it actually ends up more informative.
I imagine learned causal models can be counter-intuitive too, and I think I’d expect this by default. I agree that it seems much cleaner so long as it’s using a nice ontology with nice abstractions… - but is that likely? Would you guess it’s easier to get the causal model to do things in a ‘nice’, ‘natural’ way than it would be for an NN? Quite possibly it would be.
Thanks for the comment!
Thats… a situation I hadn’t thought about. My first reaction is that it fall in the category of “myopic defection” instead of deception. In the concrete example, as you say, you would reveal it to any overseer/observer because you don’t think anything you do would impact them (similarly to what happens in the (N,M)-deception problem).
Your point about what happens if you detect defection in training is actually a more general issue even in all examples: what do you do when you catch defection/bad incentives? LCDT isn’t trying to solve this problem, just to find a decision theory where you would always reveal such incentives in training. I personally don’t know how to solve this incentive question, and I agree it’s a concrete and crucial detail in aligning an AI.
There’s two way to think about it.
If we’re talking about a literal LCDT agent (which is what I have in mind), then it would have a learned causal model of HCH good enough to predict what the final output is. That sounds more interpretable to me than just having an opaque implementation of HCH (but it’s not already interpreted for us).
If we’re talking about systems which act like an LCDT agent but are not literally programmed to do so, I’m not so sure. I expect that they need a somewhat flexible representation of what they’re trying to represent, but maybe I’m missing a clever trick.
Oh and I don’t think “LCDT isn’t not” isn’t not what you meant.
Thanks, I corrected the typo. ;)
Ok, yes—it does seem at least to be a somewhat different issue. I need to think about it more.
Yes, though I think the better way to put this is that I wouldn’t spend effort hiding it. It’s not clear I’d actively choose to reveal it, since there’s no incentive in either direction once I think I have no influence on your decision. (I do think this is ok, since it’s the active efforts to deceive we’re most worried about)
Sure, but the case I’m thinking about is where the LCDT agent itself is little more than a wrapper around an opaque implementation of HCH. I.e. the LCDT agent’s causal model is essentially: [data] --> [Argmax HCH function] --> [action].
I assume this isn’t what you’re thinking of, but it’s not clear to me what constraints we’d apply to get the kind of thing you are thinking of. E.g. if our causal model is allowed to represent an individual human as a black-box, then why not HCH as a black-box? If we’re not allowing a human as a black-box, then how far must things be broken down into lower-level gears (at fine enough granularity I’m not sure a causal model is much clearer than a NN)?
Quite possibly there are sensible constraints we could apply to get an interpretable model. It’s just not currently clear to me what kind of thing you’re imagining—and I assume they’d come at some performance penalty.
Agreed
I need to think more about it, but my personal mental image is that to be competitive, the LCDT agent must split the human at a lower level than just one distribution (even more for HCH which is more complicated). As for why such a low-level causal model would be more interpretable than a NN:
First we know which part of the causal model correspond to the human, which is not the case in the NN
The human will be modeled only by variables on this part of the causal graph, whereas it could be completely distributed over a NN
I don’t know how to formulate it, but a causal model seems to give way more information than a NN, because it encodes the causal relationship, whereas a NN could completely compute causal relationships in a weird and counterintuitive way.
Me too!
This doesn’t follow only from [we know X is an LCDT agent that’s modeling a human] though, right? We could imagine some predicate/constraint/invariant that detects/enforces/maintains LCDTness without necessarily being transparent to humans.
I’ll grant you it seems likely so long as we have the right kind of LCDT agent—but it’s not clear to me that LCDTness itself is contributing much here.
At first sight this seems at least mostly right—but I do need to think about it more. E.g. it seems plausible that most of the work of modeling a particular human H fairly accurately is in modeling [humans-in-general] and then feeding H’s properties into that. The [humans-in-general] part may still be distributed.
I agree that this is helpful. However, I do think it’s important not to assume things are so nicely spatially organised as they would be once you got down to a molecular level model.
My intuitions are in the same direction as yours (I’m playing devil’s advocate a bit here—shockingly :)). I just don’t have principled reasons to think it actually ends up more informative.
I imagine learned causal models can be counter-intuitive too, and I think I’d expect this by default. I agree that it seems much cleaner so long as it’s using a nice ontology with nice abstractions… - but is that likely? Would you guess it’s easier to get the causal model to do things in a ‘nice’, ‘natural’ way than it would be for an NN? Quite possibly it would be.