Collin Burns is a second-year ML PhD at Berkeley, working with Jacob Steinhardt and Dan Klein, whose focus is on making language models honest, interpretable, and aligned.

In our interview we discuss his approach to doing AI Alignment research and in particular his recent paper Discovering latent knowledge in language models without supervision and the accompanying Lesswrong post.

I think this interview would be useful for people who would be interested in hearing Collin’s high-level takes on AI Alignment research or learn more about Collin’s AI Alignment agenda.

Below are some highlighted quotes from our conversation (available on Youtube, Spotify, Google Podcast, Apple Podcast). For the full context for each of these quotes, you can find the accompanying transcript.

On Alignment Research

Towards Grounded Theoretical Work And Empirical Work Targeting Future Systems

I worry that a lot of existing theoretical alignment work, not all, but I think a lot of it, is hard to verify, or it’s making assumptions that aren’t obviously true, and I worry that it’s sort of ungrounded. And, also, in some ways, too worst case. As I may have alluded to before, I think actual deep learning systems in practice aren’t worst case systems. They generalize remarkably well in all sorts of ways. Maybe there’s important structure that ends up being useful that these models have in practice even if it’s not clear why or if they should have that in theory.
Those are some of my concerns about theory within alignment. I think with empirical within alignment, in some ways, is more grounded, and we actually have feedback loops, and I think that’s important. But, also, I think it’s easier [for empirical alignment research] to be focusing on these systems that don’t say anything about future systems really. For example, I don’t think human feedback based approaches will work for future models because there’s this important disanalogy, which is that current models are human level are less. So, human evaluators can, basically, evaluate these models, whereas that will totally break down in the future. I worry mostly about empirical approaches, or just empirical research on alignment in general. Will it say anything meaningful about future systems? And so, in some ways, I try to go for something that gets at the advantages of both. (full context)

We Should Have More People Working On New Agendas From First Principles

I do want more people to be working on completely new agendas. I should also say, I think these sorts of agendas, like debate, amplification and improving AI supervision, I think this has a reasonable chance of working. It’s more like, this has been the main focus of the alignment community, or a huge fraction of it, and there just aren’t that many agendas right now. I think that’s the sort of thing I want to push back against, is lots of people just really leaning towards these sorts of approaches when I think we’re still very pre-paradigmatic and I think we haven’t figured out what is the right approach and agenda, even at a high level. I really want more people to be thinking just from scratch or from first principles, how should we solve this problem? (full context)

Researching Unsupervised Methods Because Those Are Less Likely To Break With Future Models

We want the problems we study today to be as analogous as possible to those future systems. There are all sorts of things that we can do that make it more or less analogous or disanalogous. And so, for example, I mentioned before, I think human feedback, I think that will sort of break down once you get to superhuman models. So, that’s an important disanalogy in my mind, a particularly salient one, perhaps the most important one or sort of why alignment feels hard possibly. [...]
I think [RL From Human Feedback] would break in the sense that it wouldn’t provide useful signal to superhuman systems when human evaluators can’t evaluate those systems in complicated settings. So, I mean, part of the point of this paper or one general way I think about doing alignment research more broadly is I want to make the problem as analogous as possible. One way of doing so is to maybe try to avoid using human feedback at all. This is why the method [in the paper] is unsupervised. (full context)

On Discovering Latent Knowledge Without Supervision

Recovering The Truth From The Activations Directly

If you see a lot of correct text, you should predict that future subsequent text should also be more likely to be true. And so internally it should be useful to track ‘Is this input accurate or not?’. If so, maybe you can recover that from the activations directly. And so that’s one intuition. And often it seems like these features aren’t just represented, they’re often represented in a simple way, a linear way. (full context)

Why Saying The Truth Matters For Alignment

Every story of alignment that is actually scary to me involves AI systems basically deceiving humans in some way. If the model is totally honest all the time, then surely you can just ask it, “Are you going to do something dangerous or not, deliberately?” And it’s like, “Yes, I am.” Then I think we’re in okay shape and we can prevent that. And so in some ways this feels one way of framing the core difficulty of alignment. Loosely speaking, I think this is mostly sufficient for alignment. It does depend exactly what you mean by alignment and exactly what you mean by honesty and lying and so on. But I think that’s the overall motivation. (full context)

A Misaligned Model Would Have Activations Correlated With Lying

I think [a lying misaligned AI] would be aware of when it’s lying in some sense. What do I mean by that? I mean, specifically, there would be features in its activations that are somehow correlated with when it’s lying. Maybe it uses different circuits when it’s lying or something. Somehow, those features should be useful for predicting if it’s lying or not. [...]
It’s like ‘Am I in deceptive mode?’ And so, maybe it has a neuron or something that’s like ‘Am I in deceptive mode?’ Or maybe, it’s more complicated than that. Maybe it’s like I have this weird… if you fit this medium sized MLP, a neural net on top of the activations, you’d be able to better predict if this model is lying, whatever it is. (full context)