Essentially yes, heh. I take this as a learning experience for my writing, I don’t know what I was thinking, but it is obvious in hindsight that saying to just “switch on backprop” sounds very naive.
I also confess I haven’t done the due diligence to find out what the actual largest model that has been tried with this, whether someone has tried it with Pythia or LLaMa. I’ll do some more googling tonight.
One intuition why the largest models might be different, is that part of the training/fine-tuning going on will have to do with the model’s own output. The largest models are the ones where the model’s own output is not essentially word salad.
I agree with the analysis of the ideas overall. I think however, AI x-risk does have some issue regarding communications. First of all, I think it’s very unlikely that Yann will respond to the wall of text. Even though he is responding, I imagine him more to be on the level of your college professor. He will not reply to a very detailed post. In general, I think that AI x-risk should aim to explain a bit more, rather than to take the stance that all the “But What if We Just...” has already been addressed. It may have been, but this is not the way to getting them to open up rationally to it.
Regarding Yann’s ideas, I have not looked at them in full. However, they sound like what I imagine an AI capabilities researcher would try to make as their AI alignment “baseline” model:
Hardcoding the reward will obviously not work.
Therefore, the reward function must be learned.
If an AI is trained on reward to generate a policy, whatever the AI learned to optimize can easily go off the rails once it gets out of distribution, or learn to deceive the verifiers.
Therefore, why not have the reward function explicitly in the loop with the world model & action chooser?
ChatGPT/GPT-4 seems to have a good understanding of ethics. It probably will not like it if you told it a plan was to willingly deceive human operators. As a reward model, one might think it might be robust enough.
They may think that this is enough to work. It might be worth explaining in a concise way why this baseline does not work. Surely we must have a resource on this. Even without a link (people don’t always like to follow links from those they disagree with), it might help to have some concise explanation.
Honestly, what are the failure modes? Here is what I think:
The reward model may have pathologies the action chooser could find.
The action chooser may find a way to withhold information from the reward model.
The reward model evaluates what, exactly? Text of plans? Text of plans != the entire activations (& weights) of the model...