What do you think is important about pure RL agents vs RL-finetuned language models? I expect the first powerful systems to include significant pretraining so I don’t really think much about agents that are only trained with RL (if that’s what you were referring to).
How were you thinking this would measure Goodharting in particular?
I agree that seems like a reasonable benchmark to have for getting ML researchers/academics to work on imitation learning/value learning. I don’t think I’m likely to prioritize it—I don’t think ‘inability to learn human values’ is going to be a problem for advanced AI, so I’m less excited about value learning as a key thing to work on.
By pure RL, I mean systems whose output channel is only directly optimized to maximize some value function, even if it might be possible to create other kinds of algorithms capable of getting good scores on the benchmark.
I don’t think that the lack of pretraining is a good thing in itself, but that you are losing a lot when you move from playing video games to completing textual tasks.
If someone is told to get a high score in a video game, we have access to the exact value function they are trying to maximize. So when the AI is either trying to play the game in the human’s place or trying to help them, we can directly evaluate their performance without having to worry about deception. If it learns some proxy values and starts optimizing them to the point of goodharting, it will get a lower score. On most textual tasks that aren’t purely about information manipulation, on the other hand, the AI could be making up plausible-sounding nonsense about the consequences of its actions, and we wouldn’t have any way of knowing.
From the AI’s point of view being able to see the state of the thing we care about also seems very useful, preferences are about reality after all. It’s not obvious at all that internet text contains enough information to even learn a model of human values useful in the real world. Training it with other sources of information that more closely represent reality, like online videos, might, but that seems closer to my idea than to yours since it can’t be used to perform language-model-like imitation learning.
Additionally, if by “inability to learn human values” you mean isolating them enough so that they can in principle be optimized to get superhuman performance, as opposed to being buried in its world model, I don’t agree that that will happen by default. Right now we don’t have any implementations of proper value learning algorithms, nor do I think that any known theoretical algorithm (like PreDCA) would work even with limitless computing power. If you can show that I’m wrong, that would surprise me a lot, and I think it could change many people’s research directions and the chances they give to alignment being solvable.
What do you think is important about pure RL agents vs RL-finetuned language models? I expect the first powerful systems to include significant pretraining so I don’t really think much about agents that are only trained with RL (if that’s what you were referring to).
How were you thinking this would measure Goodharting in particular?
I agree that seems like a reasonable benchmark to have for getting ML researchers/academics to work on imitation learning/value learning. I don’t think I’m likely to prioritize it—I don’t think ‘inability to learn human values’ is going to be a problem for advanced AI, so I’m less excited about value learning as a key thing to work on.
By pure RL, I mean systems whose output channel is only directly optimized to maximize some value function, even if it might be possible to create other kinds of algorithms capable of getting good scores on the benchmark.
I don’t think that the lack of pretraining is a good thing in itself, but that you are losing a lot when you move from playing video games to completing textual tasks.
If someone is told to get a high score in a video game, we have access to the exact value function they are trying to maximize. So when the AI is either trying to play the game in the human’s place or trying to help them, we can directly evaluate their performance without having to worry about deception. If it learns some proxy values and starts optimizing them to the point of goodharting, it will get a lower score. On most textual tasks that aren’t purely about information manipulation, on the other hand, the AI could be making up plausible-sounding nonsense about the consequences of its actions, and we wouldn’t have any way of knowing.
From the AI’s point of view being able to see the state of the thing we care about also seems very useful, preferences are about reality after all. It’s not obvious at all that internet text contains enough information to even learn a model of human values useful in the real world. Training it with other sources of information that more closely represent reality, like online videos, might, but that seems closer to my idea than to yours since it can’t be used to perform language-model-like imitation learning.
Additionally, if by “inability to learn human values” you mean isolating them enough so that they can in principle be optimized to get superhuman performance, as opposed to being buried in its world model, I don’t agree that that will happen by default. Right now we don’t have any implementations of proper value learning algorithms, nor do I think that any known theoretical algorithm (like PreDCA) would work even with limitless computing power. If you can show that I’m wrong, that would surprise me a lot, and I think it could change many people’s research directions and the chances they give to alignment being solvable.