In the context of Roon’s thread, while I agree basically everyone but Gwern was surprised at scaling working so far, and I agree that Ethan Caraballo’s statement about takeover having been disproved is wrong, I don’t think it can be all chalked up to mundane harm being lower than expected, and I think there’s a deeper reason that generalizes fairly far for why LW mispredicted the mundane harms of GPT-2 to GPT-4:
Specifically, people didn’t realize that constraints on instrumental convergence were necessary for capabilities to work out, and assumed far more unconstrained instrumentally convergent AIs could actually work out.
In one sense, instrumental convergence is highly incentivized for a lot of tasks, and I do think LW was correct to note instrumental convergence is quite valuable for AI capabilities.
But where I think people went wrong was in assuming that very unconstrained instrumental convergence like how humans are very motivated to take power was a useful default case, because how the human got uncontrollable instrumental convergence from a chimp’s perspective is so inefficient and took so long that the very sparse RL that humans had is unlikely to be replicated, because people want capabilities faster, but that requires putting more constraints on the instrumental convergence in order for useful capabilities to emerge like DRL.
(One reason I hate the chimpanzee/gorilla/orangutan-human analogy for AI safety is that they didn’t design our datasets, or even tried to control the first humans in any way, so there’s a huge alignment relevant disanalogy right there.)
Another way to state this is that capabilities and alignment turned out not to have as hard or as porous of a boundary as people thought, because instrumental convergence had to be constrained anyway to make it work.
Of course, predictors like GPT are the extreme end of constraining instrumental convergence, where they can’t go very far beyond modeling the next token, and still produce amazing results like world models.
But for most practical purposes, the takeaway is that AIs will likely always be more constrained in instrumental convergence than humans in the early part of training, for capabilities and alignment reasons, and the human case is not the median case for controllability, but the far off outlier case.
In the context of Roon’s thread, while I agree basically everyone but Gwern was surprised at scaling working so far, and I agree that Ethan Caraballo’s statement about takeover having been disproved is wrong, I don’t think it can be all chalked up to mundane harm being lower than expected, and I think there’s a deeper reason that generalizes fairly far for why LW mispredicted the mundane harms of GPT-2 to GPT-4:
Specifically, people didn’t realize that constraints on instrumental convergence were necessary for capabilities to work out, and assumed far more unconstrained instrumentally convergent AIs could actually work out.
In one sense, instrumental convergence is highly incentivized for a lot of tasks, and I do think LW was correct to note instrumental convergence is quite valuable for AI capabilities.
But where I think people went wrong was in assuming that very unconstrained instrumental convergence like how humans are very motivated to take power was a useful default case, because how the human got uncontrollable instrumental convergence from a chimp’s perspective is so inefficient and took so long that the very sparse RL that humans had is unlikely to be replicated, because people want capabilities faster, but that requires putting more constraints on the instrumental convergence in order for useful capabilities to emerge like DRL.
(One reason I hate the chimpanzee/gorilla/orangutan-human analogy for AI safety is that they didn’t design our datasets, or even tried to control the first humans in any way, so there’s a huge alignment relevant disanalogy right there.)
Another way to state this is that capabilities and alignment turned out not to have as hard or as porous of a boundary as people thought, because instrumental convergence had to be constrained anyway to make it work.
Of course, predictors like GPT are the extreme end of constraining instrumental convergence, where they can’t go very far beyond modeling the next token, and still produce amazing results like world models.
But for most practical purposes, the takeaway is that AIs will likely always be more constrained in instrumental convergence than humans in the early part of training, for capabilities and alignment reasons, and the human case is not the median case for controllability, but the far off outlier case.