@Jozdien sent me this paper, and I dismissed it with a cursory glance, thinking if they had to present their results using the “safe” shortening in the context they used that shortening, their results can’t be too much to sneeze at. Reading your summary, the results are minorly more impressive than I was imagining, but still in the same ballpark I think. I don’t think there’s much applicability to the safety of systems though? If I’m reading you right, you don’t get guarantees for situations for which the model is very out-of-distribution, but still behaving competently, since it hasn’t seen tabulation sequences there.
Where the results are applicable, they definitely seem like they give mixed, probably mostly negative signals. If (say) I have a stop button, and I reward my agent for shutting itself down if I press that stop button, don’t these results say that the agent won’t shut down, for the same reasons the hopper won’t fall over, even if my reward function has rewarded it for falling over in such situation? More generally, this seems to tell us in such situations we have marginally fewer degrees of freedom with which we can modify a model’s goals than we may have thought, since the stay-on-distribution aspect dominates over the reward aspect. On the other hand, “staying on distribution” is in a sense a property we do want! Is this sort of “staying on distribution” the same kind of “staying on distribution” as that used in quantilization? I don’t think so.
More generally, it seems like whether more or less sensitivity to reward, architecture, and data on the part of the functions neural networks learn is better or worse for alignment is an open problem.
@Jozdien sent me this paper, and I dismissed it with a cursory glance, thinking if they had to present their results using the “safe” shortening in the context they used that shortening, their results can’t be too much to sneeze at. Reading your summary, the results are minorly more impressive than I was imagining, but still in the same ballpark I think. I don’t think there’s much applicability to the safety of systems though? If I’m reading you right, you don’t get guarantees for situations for which the model is very out-of-distribution, but still behaving competently, since it hasn’t seen tabulation sequences there.
Where the results are applicable, they definitely seem like they give mixed, probably mostly negative signals. If (say) I have a stop button, and I reward my agent for shutting itself down if I press that stop button, don’t these results say that the agent won’t shut down, for the same reasons the hopper won’t fall over, even if my reward function has rewarded it for falling over in such situation? More generally, this seems to tell us in such situations we have marginally fewer degrees of freedom with which we can modify a model’s goals than we may have thought, since the stay-on-distribution aspect dominates over the reward aspect. On the other hand, “staying on distribution” is in a sense a property we do want! Is this sort of “staying on distribution” the same kind of “staying on distribution” as that used in quantilization? I don’t think so.
More generally, it seems like whether more or less sensitivity to reward, architecture, and data on the part of the functions neural networks learn is better or worse for alignment is an open problem.