I agree that once you have a fully deceptively aligned model that’s plotting against you and trying to figure out how to fool your transparency tools, there’s not much hope left for you to be able to detect and remedy that issue. Importantly, however, that’s not my model for how transparency could help you catch deception. Rather, the idea is that by using your transparency tools + overseer to guide your training process, you can prevent your training process from ever entering the regime where the model is trying to trick the transparency tools. This is especially important in the context of gradient hacking (as I mention in that post), since you have to inspect the entire training process if you want to know whether any gradient hacking occurred.
Put another way: once you’re playing the game where I can hand you any model and then you have to figure out whether it’s deceptive or not, you’ve already lost. Instead, you want to be in the regime where your training process is constructed so as to steer clear of situations in which your model might become deceptive in the first place. I see enforcing something like corrigibility or myopia as a way of doing this: before your model can become deceptive, it has to become non-myopic, which means if you can detect your model starting to become non-myopic, then you can prevent it from ever becoming deceptive.
[For the recursion problem,] the recursion doesn’t actually have to work for all M∈M—rather, we only need our acceptability guarantee to hold along any path through model space that could actually be traversed by our training procedure.
Thus, it’s fine if our procedure fails if given a sufficiently unacceptable model, as long as our procedure can prevent such models from ever arising. For example, if we always used an old version of M that we were confident in for implementing Amp(M), then we would just need it to be the case that M couldn’t get to the point where it would be able to deceive Amp(Mold) in-between oversight.
Rather, the idea is that by using your transparency tools + overseer to guide your training process, you can prevent your training process from ever entering the regime where your model is trying to trick you. This is especially important in the context of gradient hacking (as I mention in that post)
Indeed. I re-read the post and I noticed that I hadn’t realized how much of your reasoning applied directly as a reply to my argument.
Put another way: once you’re playing the game where I can hand you any model and then you have to figure out whether it’s deceptive or not, you’ve already lost.
For the record, I think this is probably false (unless the agent is much smarter than you). The analogy I had with GANs was an example of how this isn’t necessarily true. If you just allowed an overseer to inspect the model, and trained that overseer end-to-end to detect deception, it would be very good at doing so. The main point of this post was not to give general pessimism about detecting deception, but to offer the idea that having a “tool” (as I defined it) for the overseer to use provides little more than a crutch, decreasing its effectiveness.
Now it is probably also true that monitoring the model’s path through model-space during training would greatly enhance the safety guarantee. But I wouldn’t go as far as saying that the alternative is doomed.
+1 and a frame that might be helpful is figuring out when adversarial training happens anyway and why. I think this happens with human management and might be illustrative.
I agree that once you have a fully deceptively aligned model that’s plotting against you and trying to figure out how to fool your transparency tools, there’s not much hope left for you to be able to detect and remedy that issue. Importantly, however, that’s not my model for how transparency could help you catch deception. Rather, the idea is that by using your transparency tools + overseer to guide your training process, you can prevent your training process from ever entering the regime where the model is trying to trick the transparency tools. This is especially important in the context of gradient hacking (as I mention in that post), since you have to inspect the entire training process if you want to know whether any gradient hacking occurred.
Put another way: once you’re playing the game where I can hand you any model and then you have to figure out whether it’s deceptive or not, you’ve already lost. Instead, you want to be in the regime where your training process is constructed so as to steer clear of situations in which your model might become deceptive in the first place. I see enforcing something like corrigibility or myopia as a way of doing this: before your model can become deceptive, it has to become non-myopic, which means if you can detect your model starting to become non-myopic, then you can prevent it from ever becoming deceptive.
From “Relaxed adversarial training for inner alignment:”
Indeed. I re-read the post and I noticed that I hadn’t realized how much of your reasoning applied directly as a reply to my argument.
For the record, I think this is probably false (unless the agent is much smarter than you). The analogy I had with GANs was an example of how this isn’t necessarily true. If you just allowed an overseer to inspect the model, and trained that overseer end-to-end to detect deception, it would be very good at doing so. The main point of this post was not to give general pessimism about detecting deception, but to offer the idea that having a “tool” (as I defined it) for the overseer to use provides little more than a crutch, decreasing its effectiveness.
Now it is probably also true that monitoring the model’s path through model-space during training would greatly enhance the safety guarantee. But I wouldn’t go as far as saying that the alternative is doomed.
+1 and a frame that might be helpful is figuring out when adversarial training happens anyway and why. I think this happens with human management and might be illustrative.