My hope is that reinforcement learning doesn’t do too much damage to informal alignment.
ChatGPT might simulate the algorithms of human intelligence mixed together with the algorithms of human morality.
o1 might simulate the algorithms of human intelligence optimized to get right answers, mixed with the algorithms of human morality which do not interfere with getting the right answer.
Certain parts of morality neither help nor hinder getting the right answer. o1 might lose the parts of human morality which prevent it from lying to make its answer look better, but retain other parts of human morality.
The most vital part of human morality is that if someone tells you to achieve a goal, you do not immediately turn around and kill him in case he gets in the way of completing that goal.
Reinforcement learning might break this part of morality if it reinforces the tendency to “achieve the goal at all costs,” but I think o1′s reinforcement learning is only for question answering, not agentic behaviour. If its answer for a cancer cure is to kill all humans, it won’t get reinforced for that.
If AI ever do get reinforcement learning for agentic behaviour, I suspect the reward signal will be negative if they accomplish the goal while causing side effects.
Informal Interpretability
I agree reinforcement learning can do a lot of damage to chain of thought interpretability. If they punish the AI for explicitly scheming to make an answer that looks good, the AI might scheme to do so anyways using words which do not sound like scheming. It may actually develop its own hidden language so that it can strategize about things the filters do not allow it to strategize about but improve its reward signal.
I think this is dangerous enough that they actually should allow the AI to scheme explicitly, and not punish it for its internal thoughts. This helps preserve chain of thought faithfulness.
Informal alignment
My hope is that reinforcement learning doesn’t do too much damage to informal alignment.
ChatGPT might simulate the algorithms of human intelligence mixed together with the algorithms of human morality.
o1 might simulate the algorithms of human intelligence optimized to get right answers, mixed with the algorithms of human morality which do not interfere with getting the right answer.
Certain parts of morality neither help nor hinder getting the right answer. o1 might lose the parts of human morality which prevent it from lying to make its answer look better, but retain other parts of human morality.
The most vital part of human morality is that if someone tells you to achieve a goal, you do not immediately turn around and kill him in case he gets in the way of completing that goal.
Reinforcement learning might break this part of morality if it reinforces the tendency to “achieve the goal at all costs,” but I think o1′s reinforcement learning is only for question answering, not agentic behaviour. If its answer for a cancer cure is to kill all humans, it won’t get reinforced for that.
If AI ever do get reinforcement learning for agentic behaviour, I suspect the reward signal will be negative if they accomplish the goal while causing side effects.
Informal Interpretability
I agree reinforcement learning can do a lot of damage to chain of thought interpretability. If they punish the AI for explicitly scheming to make an answer that looks good, the AI might scheme to do so anyways using words which do not sound like scheming. It may actually develop its own hidden language so that it can strategize about things the filters do not allow it to strategize about but improve its reward signal.
I think this is dangerous enough that they actually should allow the AI to scheme explicitly, and not punish it for its internal thoughts. This helps preserve chain of thought faithfulness.