a claim I’ve been saying irl for a while but have never gotten around to writing up: current LLMs are benign not because of the language modelling objective, but because of the generalization properties of current NNs (or to be more precise, the lack thereof). with better generalization LLMs are dangerous too. we can also notice that RL policies are benign in the same ways, which should not be the case if the objective was the core reason. one thing that can go wrong with this assumption is thinking about LLMs that are both extremely good at generalizing (especially to superhuman capabilities) and simultaneously assuming they continue to have the same safety properties. afaict something like CPM avoids this failure mode of reasoning, but lots of arguments don’t
a claim I’ve been saying irl for a while but have never gotten around to writing up: current LLMs are benign not because of the language modelling objective, but because of the generalization properties of current NNs (or to be more precise, the lack thereof). with better generalization LLMs are dangerous too. we can also notice that RL policies are benign in the same ways, which should not be the case if the objective was the core reason. one thing that can go wrong with this assumption is thinking about LLMs that are both extremely good at generalizing (especially to superhuman capabilities) and simultaneously assuming they continue to have the same safety properties. afaict something like CPM avoids this failure mode of reasoning, but lots of arguments don’t
what is the “language models are benign because of the language modeling objective” take?
basically the Simulators kind of take afaict