johnswentworth comments on ELI5 Why isn’t alignment easier as models get stronger?

johnswentworth 29 Oct 2023 17:32 UTC
4 points
0
I don’t personally expect that subproblem 1, in its purest form, is relevant to the exact LLM architectures used today—i.e. stacked transformers trained mainly on pure text prediction. On the other hand, I’m not extremely highly confident that subproblem 1 isn’t relevant; I wouldn’t particularly want to rely on subproblem 1′s irrelevance as a foundational assumption.
Also, I definitely do not expect that it will be really obvious in advance when someone changes the core architecture enough that subproblem 1 becomes relevant. Really obvious that we’re not just training stacked transformers on pure text prediction, yes. Really obvious that we’re doing something dangerous, no. The space of possibilities is large, and predicting how different setups behave in advance is not easy.
All that said, I do generally consider subproblem 2 the more relevant one.