In case it is not clear: My expectation is that sufficiently large capabilities/intelligence/affordances advances inherently break our desired alignment properties under all known techniques.
Nearly every piece of empirical evidence I’ve seen contradicts this—more capable systems are generally easier to work with in almost every way, and the techniques that worked on less capable versions straightforwardly apply and in fact usually work better than on less intelligent systems.
Presumably you agree this would become false if the system was deceptively aligned or otherwise scheming against us? Perhaps the implicit claim is that we should generalize from current evidence toward thinking the deceptive alignment is very unlikely?
I also think it’s straightforward to construct cases where goodharting implies that applying the technique you used for a less capable model onto a more capable model would result in worse performance for the more capable model. I think it should be straightforward to construct such a case using scaling laws for reward model overoptimization.
(That said, I think if you vary the point of early stopping as models get more capable then you likely get strict performance improvements on most tasks. But, regardless there is a pretty reasonable technique of “train for duration X” which clearly gets worse performance in realistic cases as you go toward more capable systems.)
Nearly every piece of empirical evidence I’ve seen contradicts this—more capable systems are generally easier to work with in almost every way, and the techniques that worked on less capable versions straightforwardly apply and in fact usually work better than on less intelligent systems.
Presumably you agree this would become false if the system was deceptively aligned or otherwise scheming against us? Perhaps the implicit claim is that we should generalize from current evidence toward thinking the deceptive alignment is very unlikely?
I also think it’s straightforward to construct cases where goodharting implies that applying the technique you used for a less capable model onto a more capable model would result in worse performance for the more capable model. I think it should be straightforward to construct such a case using scaling laws for reward model overoptimization.
(That said, I think if you vary the point of early stopping as models get more capable then you likely get strict performance improvements on most tasks. But, regardless there is a pretty reasonable technique of “train for duration X” which clearly gets worse performance in realistic cases as you go toward more capable systems.)