Not at all. Preferably tomorrow though. The basic sketch if you want to derive this yourself would be that mechanistic interpretability seems unlikely to mature much as a field to the point that I can point at particular alignment relevant high-level structures in models which I wasn’t initially looking for. I anticipate it will only get to the point of being able to provide some amount of insight into why your model isn’t working correctly (this seems like a bottleneck to RL progress—not knowing why your perfectly reasonable setup isn’t working) for you to fix it, but not enough insight for you to know the reflective equilibrium of values in your agent, which seems required for it to be alignment relevant. Part of this is that current MI folk don’t even seem to track this as the end-goal of what they should be working on, so (I anticipate) they’ll just be following local gradients of impressiveness, which mostly leads towards doing capabilities relevant work.
Isn’t RL tuning problems usually because of algorithmic mis-implementation, and not models learning incorrect things?
I’m imagining a thing where you have little idea what’s wrong with your code, so you do MI on your model and can differentiate the worlds
You’re doing literally nothing. Something’s wrong with the gradient updates.
You’re doing something, but not the right thing. Something’s wrong with code-section x. (with more specific knowledge about what model internals look like, this should be possible)
You’re doing something, it causes your agent to be suboptimal because of learned representation y.
I don’t think this route is especially likely, the point is I can imagine concrete & plausible ways this research can improve capabilities. There are a lot more in the wild, and many will be caught given capabilities are easier than alignment, and there are more capabilities workers than alignment workers.
Wouldn’t the insight be alignment relevant if you “just” knew what the formed values are to begin with?
Not quite. In the ontology of shard theory, we also need to understand how our agent will do reflection, and what the activated shard distribution will be like when it starts to do reflection. Knowing the value distribution is helpful insofar as the value distribution stays constant.
More general heuristic: If you (or a loved one) are not even tracking whether your current work will solve a particular very specific & necessary alignment milestone, by default you will end up doing capabilities instead (note this is different from ‘it is sufficient to track the alignment milestone’).
Would you mind chatting about why you predict this? (Perhaps over Discord DMs)
Not at all. Preferably tomorrow though. The basic sketch if you want to derive this yourself would be that mechanistic interpretability seems unlikely to mature much as a field to the point that I can point at particular alignment relevant high-level structures in models which I wasn’t initially looking for. I anticipate it will only get to the point of being able to provide some amount of insight into why your model isn’t working correctly (this seems like a bottleneck to RL progress—not knowing why your perfectly reasonable setup isn’t working) for you to fix it, but not enough insight for you to know the reflective equilibrium of values in your agent, which seems required for it to be alignment relevant. Part of this is that current MI folk don’t even seem to track this as the end-goal of what they should be working on, so (I anticipate) they’ll just be following local gradients of impressiveness, which mostly leads towards doing capabilities relevant work.
Isn’t RL tuning problems usually because of algorithmic mis-implementation, and not models learning incorrect things?
Required to be alignment relevant? Wouldn’t the insight be alignment relevant if you “just” knew what the formed values are to begin with?
I’m imagining a thing where you have little idea what’s wrong with your code, so you do MI on your model and can differentiate the worlds
You’re doing literally nothing. Something’s wrong with the gradient updates.
You’re doing something, but not the right thing. Something’s wrong with code-section x. (with more specific knowledge about what model internals look like, this should be possible)
You’re doing something, it causes your agent to be suboptimal because of learned representation y.
I don’t think this route is especially likely, the point is I can imagine concrete & plausible ways this research can improve capabilities. There are a lot more in the wild, and many will be caught given capabilities are easier than alignment, and there are more capabilities workers than alignment workers.
Not quite. In the ontology of shard theory, we also need to understand how our agent will do reflection, and what the activated shard distribution will be like when it starts to do reflection. Knowing the value distribution is helpful insofar as the value distribution stays constant.
More general heuristic: If you (or a loved one) are not even tracking whether your current work will solve a particular very specific & necessary alignment milestone, by default you will end up doing capabilities instead (note this is different from ‘it is sufficient to track the alignment milestone’).