Isn’t RL tuning problems usually because of algorithmic mis-implementation, and not models learning incorrect things?
I’m imagining a thing where you have little idea what’s wrong with your code, so you do MI on your model and can differentiate the worlds
You’re doing literally nothing. Something’s wrong with the gradient updates.
You’re doing something, but not the right thing. Something’s wrong with code-section x. (with more specific knowledge about what model internals look like, this should be possible)
You’re doing something, it causes your agent to be suboptimal because of learned representation y.
I don’t think this route is especially likely, the point is I can imagine concrete & plausible ways this research can improve capabilities. There are a lot more in the wild, and many will be caught given capabilities are easier than alignment, and there are more capabilities workers than alignment workers.
Wouldn’t the insight be alignment relevant if you “just” knew what the formed values are to begin with?
Not quite. In the ontology of shard theory, we also need to understand how our agent will do reflection, and what the activated shard distribution will be like when it starts to do reflection. Knowing the value distribution is helpful insofar as the value distribution stays constant.
Isn’t RL tuning problems usually because of algorithmic mis-implementation, and not models learning incorrect things?
Required to be alignment relevant? Wouldn’t the insight be alignment relevant if you “just” knew what the formed values are to begin with?
I’m imagining a thing where you have little idea what’s wrong with your code, so you do MI on your model and can differentiate the worlds
You’re doing literally nothing. Something’s wrong with the gradient updates.
You’re doing something, but not the right thing. Something’s wrong with code-section x. (with more specific knowledge about what model internals look like, this should be possible)
You’re doing something, it causes your agent to be suboptimal because of learned representation y.
I don’t think this route is especially likely, the point is I can imagine concrete & plausible ways this research can improve capabilities. There are a lot more in the wild, and many will be caught given capabilities are easier than alignment, and there are more capabilities workers than alignment workers.
Not quite. In the ontology of shard theory, we also need to understand how our agent will do reflection, and what the activated shard distribution will be like when it starts to do reflection. Knowing the value distribution is helpful insofar as the value distribution stays constant.