The true loss function includes a term to incentivize going up: it’s the squared distance to the line y=x (which I think of as the alignment loss) minus the y coordinate (which I think of as a capability loss). Since the alignment loss is quadratic and the capability loss is linear (and all the distances are at least one since we’re on the integer grid), it should generally incentivize going up, but more strongly incentivize staying close to the line y=x.
If I had to guess, I would say that the models turning out unaligned just have some subtle sub-optimality in the training procedure that makes them not converge to the correct behavior.
The true loss function includes a term to incentivize going up: it’s the squared distance to the line y=x (which I think of as the alignment loss) minus the y coordinate (which I think of as a capability loss). Since the alignment loss is quadratic and the capability loss is linear (and all the distances are at least one since we’re on the integer grid), it should generally incentivize going up, but more strongly incentivize staying close to the line y=x.
If I had to guess, I would say that the models turning out unaligned just have some subtle sub-optimality in the training procedure that makes them not converge to the correct behavior.