Rather, the idea is that by using your transparency tools + overseer to guide your training process, you can prevent your training process from ever entering the regime where your model is trying to trick you. This is especially important in the context of gradient hacking (as I mention in that post)
Indeed. I re-read the post and I noticed that I hadn’t realized how much of your reasoning applied directly as a reply to my argument.
Put another way: once you’re playing the game where I can hand you any model and then you have to figure out whether it’s deceptive or not, you’ve already lost.
For the record, I think this is probably false (unless the agent is much smarter than you). The analogy I had with GANs was an example of how this isn’t necessarily true. If you just allowed an overseer to inspect the model, and trained that overseer end-to-end to detect deception, it would be very good at doing so. The main point of this post was not to give general pessimism about detecting deception, but to offer the idea that having a “tool” (as I defined it) for the overseer to use provides little more than a crutch, decreasing its effectiveness.
Now it is probably also true that monitoring the model’s path through model-space during training would greatly enhance the safety guarantee. But I wouldn’t go as far as saying that the alternative is doomed.
Indeed. I re-read the post and I noticed that I hadn’t realized how much of your reasoning applied directly as a reply to my argument.
For the record, I think this is probably false (unless the agent is much smarter than you). The analogy I had with GANs was an example of how this isn’t necessarily true. If you just allowed an overseer to inspect the model, and trained that overseer end-to-end to detect deception, it would be very good at doing so. The main point of this post was not to give general pessimism about detecting deception, but to offer the idea that having a “tool” (as I defined it) for the overseer to use provides little more than a crutch, decreasing its effectiveness.
Now it is probably also true that monitoring the model’s path through model-space during training would greatly enhance the safety guarantee. But I wouldn’t go as far as saying that the alternative is doomed.