I believe our disagreement stems from the fact that I am skeptical of the idea that statements made about contemporary language models can be extrapolated to apply to all existentially risky AI systems.
I definitely agree that some version of this is the crux, at least on how well we can generalize the result, since I think it does more generally apply than just contemporary language models, and I suspect it applies to almost all AI that can use Pretraining from Human Feedback, which is offline training, so the crux is really how much can we expect a alignment technique to generalize and scale
I believe our disagreement stems from the fact that I am skeptical of the idea that statements made about contemporary language models can be extrapolated to apply to all existentially risky AI systems.
I definitely agree that some version of this is the crux, at least on how well we can generalize the result, since I think it does more generally apply than just contemporary language models, and I suspect it applies to almost all AI that can use Pretraining from Human Feedback, which is offline training, so the crux is really how much can we expect a alignment technique to generalize and scale