Quick prediction so I can say “I told you so” as we all die later: I think all current attempts at mechanistic interpretability do far more for capabilities than alignment, and I am not persuaded by arguments of the form “there are far more capabilities researchers than mechanistic interpretability researchers, so we should expect MI people to have ~0 impact on the field”. Ditto for modern scalable oversight projects, and anything having to do with chain of thought.
Good point! Overall I don’t anticipate these layers will give you much control over what the network ends up optimizing for, but I don’t fully understand them yet either, so maybe you’re right.
Do you have specific reason to think moding the layers will easily let you control the high-level behavior, or is it just a justified hunch?
Not in isolation, but that’s just because characterizing the ultimate goal / optimization target of a system is way too difficult for the field right now. I think the important question is whether interp brings us closer such that in conjunction with more theory and/or the ability to iterate, we can get some alignment and/or corrigibility properties.
I haven’t read the paper and I’m not claiming that this will be counterfactual to some huge breakthrough, but understanding in-context learning algorithms definitely seems like a piece of the puzzle. To give a fanciful story from my skim, the paper says that the model constructs an internal training set. Say we have a technique to excise power-seeking behavior from models by removing the influence of certain training examples. If the model’s mesa-optimization algorithms operate differently, our technique might not work until we understand this and adapt the technique. Or we can edit the internal training set directly rather than trying to indirectly influence it.
@TurnTrout@cfoster0 you two were skeptical. What do you make of this? They explicitly build upon the copying heads work Anthropic’s interp team has been doing.
Not at all. Preferably tomorrow though. The basic sketch if you want to derive this yourself would be that mechanistic interpretability seems unlikely to mature much as a field to the point that I can point at particular alignment relevant high-level structures in models which I wasn’t initially looking for. I anticipate it will only get to the point of being able to provide some amount of insight into why your model isn’t working correctly (this seems like a bottleneck to RL progress—not knowing why your perfectly reasonable setup isn’t working) for you to fix it, but not enough insight for you to know the reflective equilibrium of values in your agent, which seems required for it to be alignment relevant. Part of this is that current MI folk don’t even seem to track this as the end-goal of what they should be working on, so (I anticipate) they’ll just be following local gradients of impressiveness, which mostly leads towards doing capabilities relevant work.
Isn’t RL tuning problems usually because of algorithmic mis-implementation, and not models learning incorrect things?
I’m imagining a thing where you have little idea what’s wrong with your code, so you do MI on your model and can differentiate the worlds
You’re doing literally nothing. Something’s wrong with the gradient updates.
You’re doing something, but not the right thing. Something’s wrong with code-section x. (with more specific knowledge about what model internals look like, this should be possible)
You’re doing something, it causes your agent to be suboptimal because of learned representation y.
I don’t think this route is especially likely, the point is I can imagine concrete & plausible ways this research can improve capabilities. There are a lot more in the wild, and many will be caught given capabilities are easier than alignment, and there are more capabilities workers than alignment workers.
Wouldn’t the insight be alignment relevant if you “just” knew what the formed values are to begin with?
Not quite. In the ontology of shard theory, we also need to understand how our agent will do reflection, and what the activated shard distribution will be like when it starts to do reflection. Knowing the value distribution is helpful insofar as the value distribution stays constant.
More general heuristic: If you (or a loved one) are not even tracking whether your current work will solve a particular very specific & necessary alignment milestone, by default you will end up doing capabilities instead (note this is different from ‘it is sufficient to track the alignment milestone’).
Paper that uses major mechanistic interpretability work to improve capabilities of models: https://arxiv.org/pdf/2212.14052.pdf I know of no paper which uses mechanistic interpretability work to improve the safety of models, and I expect anything people link me to will be something I don’t think will generalize to a worrying AGI.
I think a bunch of alignment value will/should come from understanding how models work internally—adjudicating between theories like “unitary mesa objectives” and “shards” and “simulators” or whatever—which lets us understand cognition better, which lets us understand both capabilities and alignment better, which indeed helps with capabilities as well as with alignment.
But, we’re just going to die in alignment-hard worlds if we don’t do anything, and it seems implausible that we can solve alignment in alignment-hard worlds by not understanding internals or inductive biases but instead relying on shallowly observable in/out behavior. EG I don’t think loss function gymnastics will help you in those worlds. Credence:75% you have to know something real about how loss provides cognitive updates.
So in those worlds, it comes down to questions of “are you getting the most relevant understanding per unit time”, and not “are you possibly advancing capabilities.” And, yes, often motivated-reasoning will whisper the former when you’re really doing the latter. That doesn’t change the truth of the first sentence.
I agree with this. I think people are bad at running that calculation, and consciously turning down status in general, so I advocate for this position because I think its basically true for many.
Most mechanistic interpretability is not in fact focused on the specific sub-problem you identify, its wandering around in a billion-parameter maze, taking note of things that look easy & interesting to understand, and telling people to work on understanding those things. I expect this to produce far more capabilities relevant insights than alignment relevant insights, especially when compared to worlds where Neel et al went in with the sole goal of separating out theories of value formation, and then did nothing else.
There’s a case to be made for exploration, but the rules of the game get wonky when you’re trying to do differential technological development. There becomes strategically relevant information you want to not know.
I expect this to produce far more capabilities relevant insights than alignment relevant insights, especially when compared to worlds where Neel et al went in with the sole goal of separating out theories of value formation, and then did nothing else.
I assume here you mean something like: given how most MI projects seem to be done, the most likely output of all these projects will be concrete interventions to make it easier for a model to become more capable, and these concrete interventions will have little to no effect on making it easier for us to direct a model towards having the ‘values’ we want it to have.
Quick prediction so I can say “I told you so” as we all die later: I think all current attempts at mechanistic interpretability do far more for capabilities than alignment, and I am not persuaded by arguments of the form “there are far more capabilities researchers than mechanistic interpretability researchers, so we should expect MI people to have ~0 impact on the field”. Ditto for modern scalable oversight projects, and anything having to do with chain of thought.
Look at that! People have used interpretability to make a mesa layer! https://arxiv.org/pdf/2309.05858.pdf
This might do more for alignment. Better that we understand mesa-optimization and can engineer it than have it mysteriously emerge.
Good point! Overall I don’t anticipate these layers will give you much control over what the network ends up optimizing for, but I don’t fully understand them yet either, so maybe you’re right.
Do you have specific reason to think moding the layers will easily let you control the high-level behavior, or is it just a justified hunch?
Not in isolation, but that’s just because characterizing the ultimate goal / optimization target of a system is way too difficult for the field right now. I think the important question is whether interp brings us closer such that in conjunction with more theory and/or the ability to iterate, we can get some alignment and/or corrigibility properties.
I haven’t read the paper and I’m not claiming that this will be counterfactual to some huge breakthrough, but understanding in-context learning algorithms definitely seems like a piece of the puzzle. To give a fanciful story from my skim, the paper says that the model constructs an internal training set. Say we have a technique to excise power-seeking behavior from models by removing the influence of certain training examples. If the model’s mesa-optimization algorithms operate differently, our technique might not work until we understand this and adapt the technique. Or we can edit the internal training set directly rather than trying to indirectly influence it.
Evan Hubinger: In my paper, I theorized about the mesa optimizer as a cautionary tale
Capabilities researchers: At long last, we have created the Mesa Layer from classic alignment paper Risks From Learned Optimization (Hubinger, 2019).
@TurnTrout @cfoster0 you two were skeptical. What do you make of this? They explicitly build upon the copying heads work Anthropic’s interp team has been doing.
As garrett says—not clear that this work is net negative. Skeptical that it’s strongly net negative. Haven’t read deeply, though.
Very strong upvote. This also deeply concerns me.
Would you mind chatting about why you predict this? (Perhaps over Discord DMs)
Not at all. Preferably tomorrow though. The basic sketch if you want to derive this yourself would be that mechanistic interpretability seems unlikely to mature much as a field to the point that I can point at particular alignment relevant high-level structures in models which I wasn’t initially looking for. I anticipate it will only get to the point of being able to provide some amount of insight into why your model isn’t working correctly (this seems like a bottleneck to RL progress—not knowing why your perfectly reasonable setup isn’t working) for you to fix it, but not enough insight for you to know the reflective equilibrium of values in your agent, which seems required for it to be alignment relevant. Part of this is that current MI folk don’t even seem to track this as the end-goal of what they should be working on, so (I anticipate) they’ll just be following local gradients of impressiveness, which mostly leads towards doing capabilities relevant work.
Isn’t RL tuning problems usually because of algorithmic mis-implementation, and not models learning incorrect things?
Required to be alignment relevant? Wouldn’t the insight be alignment relevant if you “just” knew what the formed values are to begin with?
I’m imagining a thing where you have little idea what’s wrong with your code, so you do MI on your model and can differentiate the worlds
You’re doing literally nothing. Something’s wrong with the gradient updates.
You’re doing something, but not the right thing. Something’s wrong with code-section x. (with more specific knowledge about what model internals look like, this should be possible)
You’re doing something, it causes your agent to be suboptimal because of learned representation y.
I don’t think this route is especially likely, the point is I can imagine concrete & plausible ways this research can improve capabilities. There are a lot more in the wild, and many will be caught given capabilities are easier than alignment, and there are more capabilities workers than alignment workers.
Not quite. In the ontology of shard theory, we also need to understand how our agent will do reflection, and what the activated shard distribution will be like when it starts to do reflection. Knowing the value distribution is helpful insofar as the value distribution stays constant.
More general heuristic: If you (or a loved one) are not even tracking whether your current work will solve a particular very specific & necessary alignment milestone, by default you will end up doing capabilities instead (note this is different from ‘it is sufficient to track the alignment milestone’).
Paper that uses major mechanistic interpretability work to improve capabilities of models: https://arxiv.org/pdf/2212.14052.pdf I know of no paper which uses mechanistic interpretability work to improve the safety of models, and I expect anything people link me to will be something I don’t think will generalize to a worrying AGI.
I think a bunch of alignment value will/should come from understanding how models work internally—adjudicating between theories like “unitary mesa objectives” and “shards” and “simulators” or whatever—which lets us understand cognition better, which lets us understand both capabilities and alignment better, which indeed helps with capabilities as well as with alignment.
But, we’re just going to die in alignment-hard worlds if we don’t do anything, and it seems implausible that we can solve alignment in alignment-hard worlds by not understanding internals or inductive biases but instead relying on shallowly observable in/out behavior. EG I don’t think loss function gymnastics will help you in those worlds. Credence:75% you have to know something real about how loss provides cognitive updates.
So in those worlds, it comes down to questions of “are you getting the most relevant understanding per unit time”, and not “are you possibly advancing capabilities.” And, yes, often motivated-reasoning will whisper the former when you’re really doing the latter. That doesn’t change the truth of the first sentence.
I agree with this. I think people are bad at running that calculation, and consciously turning down status in general, so I advocate for this position because I think its basically true for many.
Most mechanistic interpretability is not in fact focused on the specific sub-problem you identify, its wandering around in a billion-parameter maze, taking note of things that look easy & interesting to understand, and telling people to work on understanding those things. I expect this to produce far more capabilities relevant insights than alignment relevant insights, especially when compared to worlds where Neel et al went in with the sole goal of separating out theories of value formation, and then did nothing else.
There’s a case to be made for exploration, but the rules of the game get wonky when you’re trying to do differential technological development. There becomes strategically relevant information you want to not know.
I assume here you mean something like: given how most MI projects seem to be done, the most likely output of all these projects will be concrete interventions to make it easier for a model to become more capable, and these concrete interventions will have little to no effect on making it easier for us to direct a model towards having the ‘values’ we want it to have.
I agree with this claim: capabilities generalize very easily, while it seems extremely unlikely for there to be ‘alignment generalization’ in a way that we intend, by default. So the most likely outcome of more MI research does seem to be interventions that remove the obstacles that come in the way of achieving AGI, while not actually making progress on ‘alignment generalization’.
Indeed, this is what I mean.