I agree, I have heard that claim many times, probably including the vague claim that it’s “more dangerous” than a poorly-defined imagined alternative. A bunch of pessimistic stuff in the vein of List of Lethalities focuses on reinforcement learning, analyzing how and why that is likely to go wrong. That’s what started me thinking about true alternatives.
So yes, that does clarify why you’ve framed it that way. And I think it’s a useful question.
In fact, I would’ve been prone to say “RL is unsafe and shouldn’t be used”. Porby’s answer to your question is insightful; it notes that other types of learning aren’t that different in kind. It depends how the RL or other learning is done.
One reason that non-RL approaches (at least the few I know of) seem safer is that they’re relying on prediction or other unsupervised learning to create good, reliable representations of the world, including goals for agents. That type of learning is typically better because you can do more of it. You don’t need either a limited set of human-labeled data, which is always many orders of magnititude scarcer than data gathered from sensing the world (e.g., language input for LLMs, images for vision, etc). The other alternative is having a reward-labeling algorithm which can attach reward signals to any data, but that seems unreliable in that we don’t have even good guesses on an algorithm that can identify human values or even reliable instruction-following.
I agree, I have heard that claim many times, probably including the vague claim that it’s “more dangerous” than a poorly-defined imagined alternative. A bunch of pessimistic stuff in the vein of List of Lethalities focuses on reinforcement learning, analyzing how and why that is likely to go wrong. That’s what started me thinking about true alternatives.
So yes, that does clarify why you’ve framed it that way. And I think it’s a useful question.
In fact, I would’ve been prone to say “RL is unsafe and shouldn’t be used”. Porby’s answer to your question is insightful; it notes that other types of learning aren’t that different in kind. It depends how the RL or other learning is done.
One reason that non-RL approaches (at least the few I know of) seem safer is that they’re relying on prediction or other unsupervised learning to create good, reliable representations of the world, including goals for agents. That type of learning is typically better because you can do more of it. You don’t need either a limited set of human-labeled data, which is always many orders of magnititude scarcer than data gathered from sensing the world (e.g., language input for LLMs, images for vision, etc). The other alternative is having a reward-labeling algorithm which can attach reward signals to any data, but that seems unreliable in that we don’t have even good guesses on an algorithm that can identify human values or even reliable instruction-following.