Large language models like GPT-3 are trained on vast quantities of human-generated data. This means that a model of human psychology is implicit within the model. During fine-tuning, much of their performance gains come from how fast they are able to understand the intentions of the humans labeling their outputs.
This optimizes for models that have the best human simulations, which leads to more deception as the size of the model increases.
In practice, we will see a rapid improvement in performance, with the model finally being able to understand (or just access its existing understanding of) the intent behind human labeling/requests. This may even be seen as a win for alignment—it does what we want, not what we said! The models would be able to ask for clarification in ambiguous situations, and ask if certain requests are misspelled or badly phrased.
All the while they get better at deceiving humans and not getting caught.
I don’t like that the win condition and lose condition look so similar.
Edit: I should clarify, most of these concerns apply to pretty much all AI models. My specific issue with aligning large language models is that:
They are literally optimized to replicate human writing. Many capabilities they have come from their ability to model human psychology. There doesn’t need to be a convoluted structure that magically appears inside GPT-3 to give it the ability to simulate humans. GPT-3 is in many ways a human simulation. It “knows” how a human would evaluate its outputs, even though that information can’t always be located for a particular task.
This means that the hypothesis “do what appeals to humans, even if it contains a lot of manipulation and subtle lies, as long as you don’t get caught” can be easily located (much of human writing is dedicated to this) in the model. As tasks grow more complex and the model grows larger, the relative computation of actually completing the task increases relative to deception.
In my opinion, this methodology will be a great way for a model to learn how to persuade humans and exploit their biases because this way model might learn these biases not just from the data it collected but also fine-tune its understanding by testing its own hypotheses
Large language models like GPT-3 are trained on vast quantities of human-generated data. This means that a model of human psychology is implicit within the model. During fine-tuning, much of their performance gains come from how fast they are able to understand the intentions of the humans labeling their outputs.
This optimizes for models that have the best human simulations, which leads to more deception as the size of the model increases.
In practice, we will see a rapid improvement in performance, with the model finally being able to understand (or just access its existing understanding of) the intent behind human labeling/requests. This may even be seen as a win for alignment—it does what we want, not what we said! The models would be able to ask for clarification in ambiguous situations, and ask if certain requests are misspelled or badly phrased.
All the while they get better at deceiving humans and not getting caught.
I don’t like that the win condition and lose condition look so similar.
Edit: I should clarify, most of these concerns apply to pretty much all AI models. My specific issue with aligning large language models is that:
They are literally optimized to replicate human writing. Many capabilities they have come from their ability to model human psychology. There doesn’t need to be a convoluted structure that magically appears inside GPT-3 to give it the ability to simulate humans. GPT-3 is in many ways a human simulation. It “knows” how a human would evaluate its outputs, even though that information can’t always be located for a particular task.
This means that the hypothesis “do what appeals to humans, even if it contains a lot of manipulation and subtle lies, as long as you don’t get caught” can be easily located (much of human writing is dedicated to this) in the model. As tasks grow more complex and the model grows larger, the relative computation of actually completing the task increases relative to deception.
I agree
In my opinion, this methodology will be a great way for a model to learn how to persuade humans and exploit their biases because this way model might learn these biases not just from the data it collected but also fine-tune its understanding by testing its own hypotheses