I don’t see why we need to “perfectly” and “fully” solve “the” core challenges of alignment (as if that’s a thing that anyone knows exists). Uncharitably, it seems like many people (and I’m not mostly thinking of Sam here) have their empirically grounded models of “prosaic” AI, and then there’s the “real” alignment regime where they toss out most of their prosaic models and rely on plausible but untested memes repeated from the early days of LessWrong.
Some potential alternative candidates to e.g. ‘goal-directedness’ and related ‘memes’ (that at least I’d personally find probably more likely—though I’m probably overall closer to the ‘prosaic’ and ‘optimistic’ sides):
Systems closer to human-level could be better at situational awareness and other prerequisites for scheming, which could make it much more difficult to obtain evidence about their alignment from purely behavioral evaluations (which do seem among the main sources of evidence for safety today, alongside inability arguments). While e.g. model internals techniques don’t seem obviously good enough yet to provide similar levels of evidence and confidence.
Powerful systems will probably be capable of automating at least some parts of AI R&D and there will likely be strong incentives to use such automation. This could lead to e.g. weaker human oversight, faster AI R&D progress and takeoff, breaking of some favorable properties for safety (e.g. maybe new architectures are discovered where less CoT and similar intermediate outputs are needed, so the systems are less transparent), etc. It also seems at least plausible that any misalignment could be magnified by this faster pace, though I’m unsure here—e.g. I’m quite optimistic about various arguments about corrigibility, ‘broad basins of alignment’ and the like.
Some potential alternative candidates to e.g. ‘goal-directedness’ and related ‘memes’ (that at least I’d personally find probably more likely—though I’m probably overall closer to the ‘prosaic’ and ‘optimistic’ sides):
Systems closer to human-level could be better at situational awareness and other prerequisites for scheming, which could make it much more difficult to obtain evidence about their alignment from purely behavioral evaluations (which do seem among the main sources of evidence for safety today, alongside inability arguments). While e.g. model internals techniques don’t seem obviously good enough yet to provide similar levels of evidence and confidence.
Powerful systems will probably be capable of automating at least some parts of AI R&D and there will likely be strong incentives to use such automation. This could lead to e.g. weaker human oversight, faster AI R&D progress and takeoff, breaking of some favorable properties for safety (e.g. maybe new architectures are discovered where less CoT and similar intermediate outputs are needed, so the systems are less transparent), etc. It also seems at least plausible that any misalignment could be magnified by this faster pace, though I’m unsure here—e.g. I’m quite optimistic about various arguments about corrigibility, ‘broad basins of alignment’ and the like.