And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven’t observed others to notice on their own.
I brainstormed some possible answers. This list is a bit long. I’m publishing this comment because it’s not worth the half hour to make it concise, yet it seems worth trying the exercise before reading the post and possibly others will find it worth seeing my quick attempt.
I think the last two bullets are probably my best guesses. Nonetheless here is my list:
Just because an AI isn’t consciously deceptive, doesn’t mean it won’t deceive you, and doesn’t mean it won’t be adversarial against you. There are many types of goodhart, and many types of adversarial behavior.
It might have a heuristic to gather resources for itself, and it’s not even illegible, it’s not adversarial, and it’s not deceptive, and then someday that impulse kills you.
There is the boring problem of “the AI just stops working ”, because it’s turning down its human-modeling component generally, or because it has to do human modeling and so training it not to do deception is super duper expensive because you have to repeatedly train against loads and loads and loads of specific edge cases where thinking about humans turns into deception.
The AI stops thinking deceptive thoughts about humans, but still does catastrophic things. For example, an AI thinking about nanotech, may still build nanobots that kill everyone, and you just weren’t smart enough to train it not to / ask the right questions.
The AI does things you just don’t understand. For example it manipulates the market in strange ways but at the end your profits go up, so you let it go, even though it’s not doing anything deceptive. Just because it’s not understandably adversarial doesn’t mean it isn’t doing adversarial action. “What are you doing?” “I’m gathering resources for the company’s profits to go up.” “Are you lying to me right now?” “No, I’m making the company’s profits go up.” “How does this work?” “I can’t explain, too complicated.” ”...well, my profits are going up, so alright then.”
Like humans for whom deception is punished, it may simply self-deceive, in ways that aren’t conscious.
I think there’s a broad class of “just because code isn’t consciously deceptive, doesn’t mean it isn’t adversarial, and doesn’t mean you won’t be deceived by it”.
A little bit of code that reads <after power level == 1 million execute this other bit of code> doesn’t involve any human modeling at all, and it still could kill you.
For instance, if the AI thinks “If I ever get enough power to do more training runs on myself, do so, this will probably help me raise profits for the human” then you’re just dead because it will not pro-actively train deception out of itself. Like, it has to notice in the first place that it might become deceptive in that situation, which is a super out-of-distribution thing to think. It has to build a whole model of reflection and cognition and being adversarial to notice that this isn’t something humans want.
I think there’s a general problem where once we have superintelligences taking agentic action, to make sure they don’t screw up (e.g. training themselves, training new agents, etc) they actually have to build a whole model of the alignment problem themselves, and maybe even solve it, in order to themselves continue to count as ‘aligned’, which is way way more complex than just training out legibly deceptive thoughts. Making sure an AI does not later become deceptive via some reasonable agentic action requires it to model the alignment problem in some detail.
After writing this, I am broadly unclear whether I am showing how deception is still a problem, or showing how other problems still exist if you solve the obvious problems of deception.
Added: Wow, this post is so much richer than my guesses. I think I was on some okay lines, but I suspect it would take like 2 months to 2 years of actively trying before I would be able to write something this detailed. Not to mention that ~50% of the work is knowing which question to ask in the first place, and I did not generate this question.
Added2: The point made in Footnote 3 is pretty similar to my last two bullets.
I brainstormed some possible answers. This list is a bit long. I’m publishing this comment because it’s not worth the half hour to make it concise, yet it seems worth trying the exercise before reading the post and possibly others will find it worth seeing my quick attempt.
I think the last two bullets are probably my best guesses. Nonetheless here is my list:
Just because an AI isn’t consciously deceptive, doesn’t mean it won’t deceive you, and doesn’t mean it won’t be adversarial against you. There are many types of goodhart, and many types of adversarial behavior.
It might have a heuristic to gather resources for itself, and it’s not even illegible, it’s not adversarial, and it’s not deceptive, and then someday that impulse kills you.
There is the boring problem of “the AI just stops working ”, because it’s turning down its human-modeling component generally, or because it has to do human modeling and so training it not to do deception is super duper expensive because you have to repeatedly train against loads and loads and loads of specific edge cases where thinking about humans turns into deception.
The AI stops thinking deceptive thoughts about humans, but still does catastrophic things. For example, an AI thinking about nanotech, may still build nanobots that kill everyone, and you just weren’t smart enough to train it not to / ask the right questions.
The AI does things you just don’t understand. For example it manipulates the market in strange ways but at the end your profits go up, so you let it go, even though it’s not doing anything deceptive. Just because it’s not understandably adversarial doesn’t mean it isn’t doing adversarial action. “What are you doing?” “I’m gathering resources for the company’s profits to go up.” “Are you lying to me right now?” “No, I’m making the company’s profits go up.” “How does this work?” “I can’t explain, too complicated.” ”...well, my profits are going up, so alright then.”
Like humans for whom deception is punished, it may simply self-deceive, in ways that aren’t conscious.
I think there’s a broad class of “just because code isn’t consciously deceptive, doesn’t mean it isn’t adversarial, and doesn’t mean you won’t be deceived by it”.
A little bit of code that reads <after power level == 1 million execute this other bit of code> doesn’t involve any human modeling at all, and it still could kill you.
For instance, if the AI thinks “If I ever get enough power to do more training runs on myself, do so, this will probably help me raise profits for the human” then you’re just dead because it will not pro-actively train deception out of itself. Like, it has to notice in the first place that it might become deceptive in that situation, which is a super out-of-distribution thing to think. It has to build a whole model of reflection and cognition and being adversarial to notice that this isn’t something humans want.
I think there’s a general problem where once we have superintelligences taking agentic action, to make sure they don’t screw up (e.g. training themselves, training new agents, etc) they actually have to build a whole model of the alignment problem themselves, and maybe even solve it, in order to themselves continue to count as ‘aligned’, which is way way more complex than just training out legibly deceptive thoughts. Making sure an AI does not later become deceptive via some reasonable agentic action requires it to model the alignment problem in some detail.
After writing this, I am broadly unclear whether I am showing how deception is still a problem, or showing how other problems still exist if you solve the obvious problems of deception.
Added: Wow, this post is so much richer than my guesses. I think I was on some okay lines, but I suspect it would take like 2 months to 2 years of actively trying before I would be able to write something this detailed. Not to mention that ~50% of the work is knowing which question to ask in the first place, and I did not generate this question.
Added2: The point made in Footnote 3 is pretty similar to my last two bullets.