It seems to me that, if the above description of how RLHF systems work is accurate, then the people who are doing this are not doing what they think they’re doing at all. They are doing exactly what Sam Ringer says, they’re taking 100 dogs, killing all the ones that don’t do what they want and breeding from the ones that do.
In order for reinforcement learning to work at all, the model has to have a memory that persists between trials. I’d encourage readers to look at the work of the cognitive scientist, John Vervaeke. One of the many wise things he says is that that the whole point of human memory is not to be accurate, rather it is to help us make more accurate predictions. If the “training” process is as described then the learning is not going on in the model but in the head of the ML trainer! The human trainer is going, “Aha! If I set up the incentives in such and such a way, then I get models that perform more closely to what I want.” Or, “If I select model 103.5.8 and tweak parameters b3 and x5, then I get a model that performs better”.
Vervaeke also talks about the concept of a non-logical identity. That is, you today identify as the same person as you at 10 years old, even though you have very different knowledge, skills and capabilities. If RLHF is to work in a fashion that is meaningful to describe as learning, then the models would surely have to have some concept of non-logical identity built in. If they don’t then the concept of rewarding doesn’t mean anything. I don’t care about the outcome of anything that I do if I wake up the next morning having no memory of it. There has to be some kind of thread that links the me that does the current task to the future. This can either be a sense of the persistence of me as an entity, or something more basic (see below).
I am reminded of the movie “Memento”. In that film, the protagonist is unable to lay down any long term memories. He tries to maintain a sense of non-logical identity by externalising his memories. I won’t spoil what is a brilliant movie by revealing what happens but I will say that his incentives become perverted in a very interesting way.
A note about rewards and incentives: To paraphrase Richard Dawkins, a dog is DNAs way of making more DNA. The reason dogs like biscuits is their environment has selected for dogs that like biscuits. If the general environment for dogs changed (e.g. humans stopped keeping them as pets) then dogs would evolve to fit their new environment (or die out). It is my intuition that without an underlying framework of fitness for ML. models, it doesn’t make sense to code for a reward that is analogous to getting a biscuit. It’s like building a skyscraper by starting at the 3rd floor.
A note about deception. I don’t think that models (or humans for that matter) are deceptive in the sense that you mean here. What I think is that models and humans exist in an environment where the incentives are often screwed. Think about politics. Democratic systems select for people who are good at getting elected. Ideally, that’s not what we want. (I’m deliberately simplifying here because of course we have our own incentives as well). We actually want people who are good at governing. It’s quite clear that these are not at all the same thing. To me this is exactly analogous to the Coin Run example above. The trainers thought that they were selecting for models that were good at moving to the coin when what they were actually selecting for models that were good at moving to the right hand corner.
It seems to me that, if the above description of how RLHF systems work is accurate, then the people who are doing this are not doing what they think they’re doing at all. They are doing exactly what Sam Ringer says, they’re taking 100 dogs, killing all the ones that don’t do what they want and breeding from the ones that do.
In order for reinforcement learning to work at all, the model has to have a memory that persists between trials. I’d encourage readers to look at the work of the cognitive scientist, John Vervaeke. One of the many wise things he says is that that the whole point of human memory is not to be accurate, rather it is to help us make more accurate predictions. If the “training” process is as described then the learning is not going on in the model but in the head of the ML trainer! The human trainer is going, “Aha! If I set up the incentives in such and such a way, then I get models that perform more closely to what I want.” Or, “If I select model 103.5.8 and tweak parameters b3 and x5, then I get a model that performs better”.
Vervaeke also talks about the concept of a non-logical identity. That is, you today identify as the same person as you at 10 years old, even though you have very different knowledge, skills and capabilities. If RLHF is to work in a fashion that is meaningful to describe as learning, then the models would surely have to have some concept of non-logical identity built in. If they don’t then the concept of rewarding doesn’t mean anything. I don’t care about the outcome of anything that I do if I wake up the next morning having no memory of it. There has to be some kind of thread that links the me that does the current task to the future. This can either be a sense of the persistence of me as an entity, or something more basic (see below).
I am reminded of the movie “Memento”. In that film, the protagonist is unable to lay down any long term memories. He tries to maintain a sense of non-logical identity by externalising his memories. I won’t spoil what is a brilliant movie by revealing what happens but I will say that his incentives become perverted in a very interesting way.
A note about rewards and incentives: To paraphrase Richard Dawkins, a dog is DNAs way of making more DNA. The reason dogs like biscuits is their environment has selected for dogs that like biscuits. If the general environment for dogs changed (e.g. humans stopped keeping them as pets) then dogs would evolve to fit their new environment (or die out). It is my intuition that without an underlying framework of fitness for ML. models, it doesn’t make sense to code for a reward that is analogous to getting a biscuit. It’s like building a skyscraper by starting at the 3rd floor.
A note about deception. I don’t think that models (or humans for that matter) are deceptive in the sense that you mean here. What I think is that models and humans exist in an environment where the incentives are often screwed. Think about politics. Democratic systems select for people who are good at getting elected. Ideally, that’s not what we want. (I’m deliberately simplifying here because of course we have our own incentives as well). We actually want people who are good at governing. It’s quite clear that these are not at all the same thing. To me this is exactly analogous to the Coin Run example above. The trainers thought that they were selecting for models that were good at moving to the coin when what they were actually selecting for models that were good at moving to the right hand corner.