This is an excellent succinct restatement of the problem. I’m going to point people who understand LLMs but not AI safety to this post as an intro.
The point about desperate moves to prevent better AGIs from being developed is a good one, and pretty new to me.
The specific scenario is also useful. I agree (and I think many alignment people do) that this is what people are planning to do, that it might work, and that it’s a terrible idea.
It isn’t necessary to make an LLM agent globally purposeful by using RL, but it might very well be useful, and the most obvious route. I think that’s a lot more dangerous than keeping RL entirely out of the picture. A language model agent that literally asks itself “what do I want and how do I get it?” can operate entirely in the language-prediction domain, with zero RL (or more likely, a little in the fine-tuning of RLHF or RLAIF, but not applied directly to actions). My posts Goals selected from learned knowledge: an alternative to RL alignment and Instruction-following AGI is easier and more likely than value aligned AGI detail how that would work.
Of course I expect people to do whatever works, even if it’s likely to get us all killed. One hope is that using RL for long-horizon tasks is just less efficient than working out how to do it in abstract-linguistic-reasoning, since it takes lots of time to perform the tasks, and generalization between useful tasks might be difficult.
This is an excellent succinct restatement of the problem. I’m going to point people who understand LLMs but not AI safety to this post as an intro.
The point about desperate moves to prevent better AGIs from being developed is a good one, and pretty new to me.
The specific scenario is also useful. I agree (and I think many alignment people do) that this is what people are planning to do, that it might work, and that it’s a terrible idea.
It isn’t necessary to make an LLM agent globally purposeful by using RL, but it might very well be useful, and the most obvious route. I think that’s a lot more dangerous than keeping RL entirely out of the picture. A language model agent that literally asks itself “what do I want and how do I get it?” can operate entirely in the language-prediction domain, with zero RL (or more likely, a little in the fine-tuning of RLHF or RLAIF, but not applied directly to actions). My posts Goals selected from learned knowledge: an alternative to RL alignment and Instruction-following AGI is easier and more likely than value aligned AGI detail how that would work.
Of course I expect people to do whatever works, even if it’s likely to get us all killed. One hope is that using RL for long-horizon tasks is just less efficient than working out how to do it in abstract-linguistic-reasoning, since it takes lots of time to perform the tasks, and generalization between useful tasks might be difficult.