A well-known analogy from Yann LeCun: if machine learning is a cake, then unsupervised learning is the cake itself, supervised learning is the icing, and reinforcement learning is the cherry on top.
I think this is useful for framing my core concerns about current safety research:
If we think that unsupervised learning will produce safe agents, then why will the comparatively small contributions of SL and RL make them unsafe?
If we think that unsupervised learning will produce dangerous agents, then why will safety techniques which focus on SL and RL (i.e. basically all of them) work, when they’re making comparatively small updates to agents which are already misaligned?
I do think it’s more complicated than I’ve portrayed here, but I haven’t yet seen a persuasive response to the core intuition.
I’m not aware of any airtight argument that “pure” self-supervised learning systems, either generically or with any particular architecture, are safe to use, to arbitrary levels of intelligence, though it seems very much worth someone trying to prove or disprove that. For my part, I got distracted by other things and haven’t thought about it much since then.
The other issue is whether “pure” self-supervised learning systems would be capable enough to satisfy our AGI needs, or to safely bootstrap to systems that are. I go back and forth on this. One side of the argument I wrote up here. The other side is, I’m now (vaguely) thinking that people need a reward system to decide what thoughts to think, and the fact that GPT-3 doesn’t need reward is not evidence of reward being unimportant but rather evidence that GPT-3 is nothing like an AGI. Well, maybe.
For humans, self-supervised learning forms the latent representations, but the reward system controls action selection. It’s not altogether unreasonable to think that action selection, and hence reward, is a more important thing to focus on for safety research. AGIs are dangerous when they take dangerous actions, to a first approximation. The fact that a larger fraction of neocortical synapses are adjusted by self-supervised learning than by reward learning is interesting and presumably safety-relevant, but I don’t think it immediately proves that self-supervised learning has a similarly larger fraction of the answers to AGI safety questions. Maybe, maybe not, it’s not immediately obvious. :-)
A well-known analogy from Yann LeCun: if machine learning is a cake, then unsupervised learning is the cake itself, supervised learning is the icing, and reinforcement learning is the cherry on top.
I think this is useful for framing my core concerns about current safety research:
If we think that unsupervised learning will produce safe agents, then why will the comparatively small contributions of SL and RL make them unsafe?
If we think that unsupervised learning will produce dangerous agents, then why will safety techniques which focus on SL and RL (i.e. basically all of them) work, when they’re making comparatively small updates to agents which are already misaligned?
I do think it’s more complicated than I’ve portrayed here, but I haven’t yet seen a persuasive response to the core intuition.
I wrote a few posts on self-supervised learning last year:
https://www.lesswrong.com/posts/SaLc9Dv5ZqD73L3nE/the-self-unaware-ai-oracle
https://www.lesswrong.com/posts/EMZeJ7vpfeF4GrWwm/self-supervised-learning-and-agi-safety
https://www.lesswrong.com/posts/L3Ryxszc3X2J7WRwt/self-supervised-learning-and-manipulative-predictions
I’m not aware of any airtight argument that “pure” self-supervised learning systems, either generically or with any particular architecture, are safe to use, to arbitrary levels of intelligence, though it seems very much worth someone trying to prove or disprove that. For my part, I got distracted by other things and haven’t thought about it much since then.
The other issue is whether “pure” self-supervised learning systems would be capable enough to satisfy our AGI needs, or to safely bootstrap to systems that are. I go back and forth on this. One side of the argument I wrote up here. The other side is, I’m now (vaguely) thinking that people need a reward system to decide what thoughts to think, and the fact that GPT-3 doesn’t need reward is not evidence of reward being unimportant but rather evidence that GPT-3 is nothing like an AGI. Well, maybe.
For humans, self-supervised learning forms the latent representations, but the reward system controls action selection. It’s not altogether unreasonable to think that action selection, and hence reward, is a more important thing to focus on for safety research. AGIs are dangerous when they take dangerous actions, to a first approximation. The fact that a larger fraction of neocortical synapses are adjusted by self-supervised learning than by reward learning is interesting and presumably safety-relevant, but I don’t think it immediately proves that self-supervised learning has a similarly larger fraction of the answers to AGI safety questions. Maybe, maybe not, it’s not immediately obvious. :-)