If you want an agentic system (and I think many humans do, because agents can get things done), you’ve got to give it goals somehow. RL is one way to do that. The question of whether that’s less safe isn’t meaningful without comparing it to another method of giving it goals.
The method I think is both safer and implementable is giving goals in natural language, to a system that primarily “thinks” in natural language. I think this is markedly safer than any RL proposal anyone has come up with so far. And there are some other options for specifying goals without using RL, each of which does seem safer to me:
Surely asking if anything is safer is only sensible when comparing it to something. Are you comparing it to some implicit expected-if-not RL method of alignment? I don’t think we have a commonly shared concept of what that would be. That’s why I’m pointing to some explicit alternatives in that post.
I’ve heard people suggest that they have arguments related to RL being particularly dangerous, although I have to admit that I’m struggling to find these arguments at the moment. I don’t know, perhaps that helps clarify why I’ve framed the question the way that I’ve framed it?
I agree, I have heard that claim many times, probably including the vague claim that it’s “more dangerous” than a poorly-defined imagined alternative. A bunch of pessimistic stuff in the vein of List of Lethalities focuses on reinforcement learning, analyzing how and why that is likely to go wrong. That’s what started me thinking about true alternatives.
So yes, that does clarify why you’ve framed it that way. And I think it’s a useful question.
In fact, I would’ve been prone to say “RL is unsafe and shouldn’t be used”. Porby’s answer to your question is insightful; it notes that other types of learning aren’t that different in kind. It depends how the RL or other learning is done.
One reason that non-RL approaches (at least the few I know of) seem safer is that they’re relying on prediction or other unsupervised learning to create good, reliable representations of the world, including goals for agents. That type of learning is typically better because you can do more of it. You don’t need either a limited set of human-labeled data, which is always many orders of magnititude scarcer than data gathered from sensing the world (e.g., language input for LLMs, images for vision, etc). The other alternative is having a reward-labeling algorithm which can attach reward signals to any data, but that seems unreliable in that we don’t have even good guesses on an algorithm that can identify human values or even reliable instruction-following.
Compared to what?
If you want an agentic system (and I think many humans do, because agents can get things done), you’ve got to give it goals somehow. RL is one way to do that. The question of whether that’s less safe isn’t meaningful without comparing it to another method of giving it goals.
The method I think is both safer and implementable is giving goals in natural language, to a system that primarily “thinks” in natural language. I think this is markedly safer than any RL proposal anyone has come up with so far. And there are some other options for specifying goals without using RL, each of which does seem safer to me:
Goals selected from learned knowledge: an alternative to RL alignment
I think it’s still valid to ask in the abstract whether RL is a particularly dangerous approach to training an AI system.
Surely asking if anything is safer is only sensible when comparing it to something. Are you comparing it to some implicit expected-if-not RL method of alignment? I don’t think we have a commonly shared concept of what that would be. That’s why I’m pointing to some explicit alternatives in that post.
I’ve heard people suggest that they have arguments related to RL being particularly dangerous, although I have to admit that I’m struggling to find these arguments at the moment. I don’t know, perhaps that helps clarify why I’ve framed the question the way that I’ve framed it?
I agree, I have heard that claim many times, probably including the vague claim that it’s “more dangerous” than a poorly-defined imagined alternative. A bunch of pessimistic stuff in the vein of List of Lethalities focuses on reinforcement learning, analyzing how and why that is likely to go wrong. That’s what started me thinking about true alternatives.
So yes, that does clarify why you’ve framed it that way. And I think it’s a useful question.
In fact, I would’ve been prone to say “RL is unsafe and shouldn’t be used”. Porby’s answer to your question is insightful; it notes that other types of learning aren’t that different in kind. It depends how the RL or other learning is done.
One reason that non-RL approaches (at least the few I know of) seem safer is that they’re relying on prediction or other unsupervised learning to create good, reliable representations of the world, including goals for agents. That type of learning is typically better because you can do more of it. You don’t need either a limited set of human-labeled data, which is always many orders of magnititude scarcer than data gathered from sensing the world (e.g., language input for LLMs, images for vision, etc). The other alternative is having a reward-labeling algorithm which can attach reward signals to any data, but that seems unreliable in that we don’t have even good guesses on an algorithm that can identify human values or even reliable instruction-following.
I sometimes say things kinda like that, e.g. here.