I agree with the analysis of the ideas overall. I think however, AI x-risk does have some issue regarding communications. First of all, I think it’s very unlikely that Yann will respond to the wall of text. Even though he is responding, I imagine him more to be on the level of your college professor. He will not reply to a very detailed post. In general, I think that AI x-risk should aim to explain a bit more, rather than to take the stance that all the “But What if We Just...” has already been addressed. It may have been, but this is not the way to getting them to open up rationally to it.
Regarding Yann’s ideas, I have not looked at them in full. However, they sound like what I imagine an AI capabilities researcher would try to make as their AI alignment “baseline” model:
Hardcoding the reward will obviously not work.
Therefore, the reward function must be learned.
If an AI is trained on reward to generate a policy, whatever the AI learned to optimize can easily go off the rails once it gets out of distribution, or learn to deceive the verifiers.
Therefore, why not have the reward function explicitly in the loop with the world model & action chooser?
ChatGPT/GPT-4 seems to have a good understanding of ethics. It probably will not like it if you told it a plan was to willingly deceive human operators. As a reward model, one might think it might be robust enough.
They may think that this is enough to work. It might be worth explaining in a concise way why this baseline does not work. Surely we must have a resource on this. Even without a link (people don’t always like to follow links from those they disagree with), it might help to have some concise explanation.
Honestly, what are the failure modes? Here is what I think:
The reward model may have pathologies the action chooser could find.
The action chooser may find a way to withhold information from the reward model.
The reward model evaluates what, exactly? Text of plans? Text of plans != the entire activations (& weights) of the model...
ChatGPT/GPT-4 seems to have a good understanding of ethics. It probably will not like it if you told it a plan was to willingly deceive human operators. As a reward model, one might think it might be robust enough.
This understanding has so far proven to be very shallow and does not actually control behavior, and is therefore insufficient. Users regularly get around it by asking the AI to pretend to be evil, or to write a story, and so on. It is demonstrably not robust. It is also demonstrably very easy for minds (current-AI, human, dog, corporate, or otherwise) to know things and not act on them, even when those actions control rewards.
If I try to imagine LeCun not being aware of this already, I find it hard to get my brain out of Upton Sinclair “It is difficult to get a man to understand something, when his salary depends on his not understanding it,” territory.
I agree with the analysis of the ideas overall. I think however, AI x-risk does have some issue regarding communications. First of all, I think it’s very unlikely that Yann will respond to the wall of text. Even though he is responding, I imagine him more to be on the level of your college professor. He will not reply to a very detailed post. In general, I think that AI x-risk should aim to explain a bit more, rather than to take the stance that all the “But What if We Just...” has already been addressed. It may have been, but this is not the way to getting them to open up rationally to it.
Regarding Yann’s ideas, I have not looked at them in full. However, they sound like what I imagine an AI capabilities researcher would try to make as their AI alignment “baseline” model:
Hardcoding the reward will obviously not work.
Therefore, the reward function must be learned.
If an AI is trained on reward to generate a policy, whatever the AI learned to optimize can easily go off the rails once it gets out of distribution, or learn to deceive the verifiers.
Therefore, why not have the reward function explicitly in the loop with the world model & action chooser?
ChatGPT/GPT-4 seems to have a good understanding of ethics. It probably will not like it if you told it a plan was to willingly deceive human operators. As a reward model, one might think it might be robust enough.
They may think that this is enough to work. It might be worth explaining in a concise way why this baseline does not work. Surely we must have a resource on this. Even without a link (people don’t always like to follow links from those they disagree with), it might help to have some concise explanation.
Honestly, what are the failure modes? Here is what I think:
The reward model may have pathologies the action chooser could find.
The action chooser may find a way to withhold information from the reward model.
The reward model evaluates what, exactly? Text of plans? Text of plans != the entire activations (& weights) of the model...
This understanding has so far proven to be very shallow and does not actually control behavior, and is therefore insufficient. Users regularly get around it by asking the AI to pretend to be evil, or to write a story, and so on. It is demonstrably not robust. It is also demonstrably very easy for minds (current-AI, human, dog, corporate, or otherwise) to know things and not act on them, even when those actions control rewards.
If I try to imagine LeCun not being aware of this already, I find it hard to get my brain out of Upton Sinclair “It is difficult to get a man to understand something, when his salary depends on his not understanding it,” territory.