I was one of those people who you asked the skeptical question to, and I feel like I have a better reply now than I did at the time. In particular, your objection was
To generalize my question, what if something goes wrong, we peek inside and find out that it’s one of the 10-15% of times when the model doesn’t agree with the known-algorithm which is used to generate the penalty term?
I agree this is an issue, but at worst it puts a bound on how well we can inspect the neural network’s behavior. In other words, it means something like, “Our model of what this neural network is doing is wrong X% of the time.” This sounds bad, but X can also be quite low. Perhaps more importantly though, we shouldn’t expect by default that in the X% of times where our guess is bad, that the neural network is adversarially optimizing against us.
The errors that we make are potentially neutral errors, meaning that the AI could be doing something either bad or good in those intervals, but probably nothing purposely catastrophic. We can strengthen this condition by using adversarial training to purposely search for interpretations that would prioritize exposing catastrophic planning.
ETA: This is essentially why engineers don’t need to employ quantum mechanics to argue that their designs are safe. The normal models that are less computationally demanding might be less accurate, but by default engineers don’t think that their bridge is going to adversarially optimize for the (small) X% where predictions disagree. There is of course a lot of stuff to be said about when this assumption does not apply to AI designs.
Perhaps more importantly though, we shouldn’t expect by default that in the X% of times where our guess is bad, that the neural network is adversarially optimizing against us.
I’m confused because the post you made one day later from this comment seems to argue the opposite of this. Did you change your mind in between, or am I missing something?
I thought about your objection longer and realized that there are circumstances where we can expect the model to adversarially optimize against us. I think I’ve less changed my mind, and more clarified when I think these tools are useful. In the process, I also discovered that Chris Olah and Evan Hubinger seem to agree: naively using transparency tools can break down in the deception case.
I was one of those people who you asked the skeptical question to, and I feel like I have a better reply now than I did at the time. In particular, your objection was
I agree this is an issue, but at worst it puts a bound on how well we can inspect the neural network’s behavior. In other words, it means something like, “Our model of what this neural network is doing is wrong X% of the time.” This sounds bad, but X can also be quite low. Perhaps more importantly though, we shouldn’t expect by default that in the X% of times where our guess is bad, that the neural network is adversarially optimizing against us.
The errors that we make are potentially neutral errors, meaning that the AI could be doing something either bad or good in those intervals, but probably nothing purposely catastrophic. We can strengthen this condition by using adversarial training to purposely search for interpretations that would prioritize exposing catastrophic planning.
ETA: This is essentially why engineers don’t need to employ quantum mechanics to argue that their designs are safe. The normal models that are less computationally demanding might be less accurate, but by default engineers don’t think that their bridge is going to adversarially optimize for the (small) X% where predictions disagree. There is of course a lot of stuff to be said about when this assumption does not apply to AI designs.
I’m confused because the post you made one day later from this comment seems to argue the opposite of this. Did you change your mind in between, or am I missing something?
I thought about your objection longer and realized that there are circumstances where we can expect the model to adversarially optimize against us. I think I’ve less changed my mind, and more clarified when I think these tools are useful. In the process, I also discovered that Chris Olah and Evan Hubinger seem to agree: naively using transparency tools can break down in the deception case.