In current user-facing LLMs like ChatGPT or Claude, the closest approximation to goals may be being helpful, harmless, and honest.
According to my understanding of RLHF, the goal-approximation it trains for is “Write a prompt that is likely to be rated as positive”. In ChatGPT / Claude, this is indeed highly correlated with being helpful, harmless, and honest, since the model’s best strategy for getting high ratings is to be those things. If models are smarter than us, this may cease to be the case, as being maximally honest may begin to conflict with the real goal of getting a positive rating. (e.g, if the model knows something the raters don’t, it will be penalised for telling the truth, which may optimise for deceptive qualities) Does this seem right?
Seems like one of multiple plausible hypotheses. I think the fact that models generalize their HHH really well to very OOD settings and their generalization abilities in general could also mean that they actually “understood” that they are supposed to be HHH, e.g. because they were pre-prompted with this information during fine-tuning.
I think your hypothesis of seeking positive ratings is just as likely but I don’t feel like we have the evidence to clearly say so wth is going on inside LLMs or what their “goals” are.
Interesting. That does give me an idea for a potentially useful experiment! We could finetune GPT-4 (or RLHF an open source LLM that isn’t finetuned, if there’s one capable enough and not a huge infra pain to get running, but this seems a lot harder) on a “helpful, harmless, honest” directive, but change the data so that one particular topic or area contains clearly false information. For instance, Canada is located in Asia.
Does the model then:
Deeply internalise this new information? (I suspect not, but if it does, this would be a good sign for scalable oversight and the HHH generalisation hypothesis)
Score worse on honesty in general even in unrelated topics? (I also suspect not, but I could see this going either way—this would be a bad sign for scalable oversight. It would be a good sign for the HHH generalisation hypothesis, but not a good sign that this will continue to hold with smarter AI’s)
One hard part is that it’s difficult to disentangle “Competently lies about the location of Canada” and “Actually believes, insomuch as a language model believes anything, that Canada is in Asia now”, but if the model is very robustly confident about Canada being in Asia in this experiment, trying to catch it out feels like the kind of thing Apollo may want to get good at anyway.
Sounds like an interesting direction. I expect there are lots of other explanations for this behavior, so I’d not count it as strong evidence to disentangle these hypotheses. It sounds like something we may do in a year or so but it’s far away from the top of our priority list. There is a good chance, we will never run it. If someone else wants to pick this up, feel free to take it on.
According to my understanding of RLHF, the goal-approximation it trains for is “Write a prompt that is likely to be rated as positive”. In ChatGPT / Claude, this is indeed highly correlated with being helpful, harmless, and honest, since the model’s best strategy for getting high ratings is to be those things. If models are smarter than us, this may cease to be the case, as being maximally honest may begin to conflict with the real goal of getting a positive rating. (e.g, if the model knows something the raters don’t, it will be penalised for telling the truth, which may optimise for deceptive qualities) Does this seem right?
Seems like one of multiple plausible hypotheses. I think the fact that models generalize their HHH really well to very OOD settings and their generalization abilities in general could also mean that they actually “understood” that they are supposed to be HHH, e.g. because they were pre-prompted with this information during fine-tuning.
I think your hypothesis of seeking positive ratings is just as likely but I don’t feel like we have the evidence to clearly say so wth is going on inside LLMs or what their “goals” are.
Interesting. That does give me an idea for a potentially useful experiment! We could finetune GPT-4 (or RLHF an open source LLM that isn’t finetuned, if there’s one capable enough and not a huge infra pain to get running, but this seems a lot harder) on a “helpful, harmless, honest” directive, but change the data so that one particular topic or area contains clearly false information. For instance, Canada is located in Asia.
Does the model then:
Deeply internalise this new information? (I suspect not, but if it does, this would be a good sign for scalable oversight and the HHH generalisation hypothesis)
Score worse on honesty in general even in unrelated topics? (I also suspect not, but I could see this going either way—this would be a bad sign for scalable oversight. It would be a good sign for the HHH generalisation hypothesis, but not a good sign that this will continue to hold with smarter AI’s)
One hard part is that it’s difficult to disentangle “Competently lies about the location of Canada” and “Actually believes, insomuch as a language model believes anything, that Canada is in Asia now”, but if the model is very robustly confident about Canada being in Asia in this experiment, trying to catch it out feels like the kind of thing Apollo may want to get good at anyway.
Sounds like an interesting direction. I expect there are lots of other explanations for this behavior, so I’d not count it as strong evidence to disentangle these hypotheses. It sounds like something we may do in a year or so but it’s far away from the top of our priority list. There is a good chance, we will never run it. If someone else wants to pick this up, feel free to take it on.