I’m curious, do you actually endorse subproblem 1?
Under the current ML paradigm (transformers) the model becoming dangerous during training seems extremely implausible to me.
I could imagine a ML paradigm where subproblem 1 was real (for example, training an RL agent to hack computers and it unsandboxes itself). But it seems like it would be really obvious that you were doing something dangerous beforehand.
I don’t personally expect that subproblem 1, in its purest form, is relevant to the exact LLM architectures used today—i.e. stacked transformers trained mainly on pure text prediction. On the other hand, I’m not extremely highly confident that subproblem 1 isn’t relevant; I wouldn’t particularly want to rely on subproblem 1′s irrelevance as a foundational assumption.
Also, I definitely do not expect that it will be really obvious in advance when someone changes the core architecture enough that subproblem 1 becomes relevant. Really obvious that we’re not just training stacked transformers on pure text prediction, yes. Really obvious that we’re doing something dangerous, no. The space of possibilities is large, and predicting how different setups behave in advance is not easy.
All that said, I do generally consider subproblem 2 the more relevant one.
I’m curious, do you actually endorse subproblem 1?
Under the current ML paradigm (transformers) the model becoming dangerous during training seems extremely implausible to me.
I could imagine a ML paradigm where subproblem 1 was real (for example, training an RL agent to hack computers and it unsandboxes itself). But it seems like it would be really obvious that you were doing something dangerous beforehand.
I don’t personally expect that subproblem 1, in its purest form, is relevant to the exact LLM architectures used today—i.e. stacked transformers trained mainly on pure text prediction. On the other hand, I’m not extremely highly confident that subproblem 1 isn’t relevant; I wouldn’t particularly want to rely on subproblem 1′s irrelevance as a foundational assumption.
Also, I definitely do not expect that it will be really obvious in advance when someone changes the core architecture enough that subproblem 1 becomes relevant. Really obvious that we’re not just training stacked transformers on pure text prediction, yes. Really obvious that we’re doing something dangerous, no. The space of possibilities is large, and predicting how different setups behave in advance is not easy.
All that said, I do generally consider subproblem 2 the more relevant one.