One of the updates this paper (once again) reinforced for me is that human psychology applies to LLMs, since it was trained (one might almost say distilled) into them during pretraining, and it applies better to larger LLMs, because they have more capacity to absorb it. I’m a lot more concerned about RL than I am about SGD fine tuning, or supervised fien tuning: there if you’re careful enough about your training set it’s fairly predictable what effect it should have. My suggestion is that before we start using RL (if we do at all), we should finetune our LLMs to manifest behavior and psychology that is pleasant, kindly, honest, and selfless (along the lines I discuss in Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?)
That obviously doesn’t completely rule out deceptive alignment, but I think it could make a big difference to the inductive priors of it arising, if it’s out of character.
One of the updates this paper (once again) reinforced for me is that human psychology applies to LLMs, since it was trained (one might almost say distilled) into them during pretraining, and it applies better to larger LLMs, because they have more capacity to absorb it. I’m a lot more concerned about RL than I am about SGD fine tuning, or supervised fien tuning: there if you’re careful enough about your training set it’s fairly predictable what effect it should have. My suggestion is that before we start using RL (if we do at all), we should finetune our LLMs to manifest behavior and psychology that is pleasant, kindly, honest, and selfless (along the lines I discuss in Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?)
That obviously doesn’t completely rule out deceptive alignment, but I think it could make a big difference to the inductive priors of it arising, if it’s out of character.