Very nice, these arguments seem reasonable. I’d like to make a related point about how we might address deceptive alignment which makes me substantially more optimistic about the problem. (I’ve been meaning to write a full post on this, but this was a good impetus to make the case concisely.)
Conceptual interpretability in the vein of Collin Burns, Alex Turner, and Representation Engineering seems surprisingly close to allowing us to understand a model’s internal beliefs and detect deceptive alignment. Collin Burns’s work was very exciting to at least some people because it provided an unsupervised method for detecting a model’s beliefs. Collin’s explanation of his theory of impact is really helpful here. Broadly, because it allows us to understand a model’s beliefs without using any labels provided by humans, it should be able to detect deception in superhuman models where humans cannot provide accurate feedback.
Over the last year, there’s been a lot of research that meaningfully extends Collin’s work. I think this could be used to detect deceptively aligned models if they arise within the next few years, and I’d be really excited about more people working on it. Let me highlight just a few contributions:
Scott Emmons and Fabien Roger showed that the most important part of Collin’s method was contrast pairs. The original paper focused on “logical consistency properties of truth” such as P(X) + P(!X) = 1. While this is an interesting idea, its performance is hardly better than a much simpler strategy relegated to Appendix G.3: taking the average difference between a model’s activations at a particular layer for many contrast pairs of the form X and !X. Collin shows this direction empirically coincides with truthfulness.
Alex Turner and his SERI MATS stream took seriously the idea that contrast pairs could reveal directions in a model’s latent space which correspond to concepts. They calculated a “cheese vector” in a maze-solving agent as the difference in the model’s activations between when the cheese was present and when it was missing. By adding and subtracting this vector to future forward passes of the model, its behavior could be controlled in surprising ways. GPT-2 can also be subtly controlled with this technique.
Representation Engineering starts with this idea and runs a large number of empirical experiments. It finds a concept of truthfulness that dramatically improves performance on TruthfulQA (36% to 54%), as well as concepts of power-seeking, honesty, and morality that can control the behavior of language agents.
These are all unsupervised methods for detecting model beliefs. They’ve empirically improved performance on many real world tasks today, and it seems possible that they’d soon be able to detect deceptive alignment.
(Without providing a full explanation, these two related papers (1, 2) are also interesting.)
Future work on this topic could attempt to disambiguate between directions in latent space corresponding to “human simulators” versus “direct translation” of the model’s beliefs. It could also examine whether these directions are robust to optimization pressure. For example, if you train a model to beat a lie detector test based on these methods, will the lie detector still work after many rounds of optimization? I’d also be excited about straightforward empirical extensions of these unsupervised techniques applied to standard ML benchmarks, as there are many ways to vary the methods and it’s unclear which variants would be the most effective.
Very nice, these arguments seem reasonable. I’d like to make a related point about how we might address deceptive alignment which makes me substantially more optimistic about the problem. (I’ve been meaning to write a full post on this, but this was a good impetus to make the case concisely.)
Conceptual interpretability in the vein of Collin Burns, Alex Turner, and Representation Engineering seems surprisingly close to allowing us to understand a model’s internal beliefs and detect deceptive alignment. Collin Burns’s work was very exciting to at least some people because it provided an unsupervised method for detecting a model’s beliefs. Collin’s explanation of his theory of impact is really helpful here. Broadly, because it allows us to understand a model’s beliefs without using any labels provided by humans, it should be able to detect deception in superhuman models where humans cannot provide accurate feedback.
Over the last year, there’s been a lot of research that meaningfully extends Collin’s work. I think this could be used to detect deceptively aligned models if they arise within the next few years, and I’d be really excited about more people working on it. Let me highlight just a few contributions:
Scott Emmons and Fabien Roger showed that the most important part of Collin’s method was contrast pairs. The original paper focused on “logical consistency properties of truth” such as P(X) + P(!X) = 1. While this is an interesting idea, its performance is hardly better than a much simpler strategy relegated to Appendix G.3: taking the average difference between a model’s activations at a particular layer for many contrast pairs of the form X and !X. Collin shows this direction empirically coincides with truthfulness.
Alex Turner and his SERI MATS stream took seriously the idea that contrast pairs could reveal directions in a model’s latent space which correspond to concepts. They calculated a “cheese vector” in a maze-solving agent as the difference in the model’s activations between when the cheese was present and when it was missing. By adding and subtracting this vector to future forward passes of the model, its behavior could be controlled in surprising ways. GPT-2 can also be subtly controlled with this technique.
Representation Engineering starts with this idea and runs a large number of empirical experiments. It finds a concept of truthfulness that dramatically improves performance on TruthfulQA (36% to 54%), as well as concepts of power-seeking, honesty, and morality that can control the behavior of language agents.
These are all unsupervised methods for detecting model beliefs. They’ve empirically improved performance on many real world tasks today, and it seems possible that they’d soon be able to detect deceptive alignment.
(Without providing a full explanation, these two related papers (1, 2) are also interesting.)
Future work on this topic could attempt to disambiguate between directions in latent space corresponding to “human simulators” versus “direct translation” of the model’s beliefs. It could also examine whether these directions are robust to optimization pressure. For example, if you train a model to beat a lie detector test based on these methods, will the lie detector still work after many rounds of optimization? I’d also be excited about straightforward empirical extensions of these unsupervised techniques applied to standard ML benchmarks, as there are many ways to vary the methods and it’s unclear which variants would be the most effective.