ML models in the current paradigm do not seem to behave coherently OOD but I’d bet for nearly any metric of “overall capability” and alignment that the capability metric decays faster vs alignment as we go further OOD.
See https://arxiv.org/abs/2310.00873 for an example of the kinds of things you’d expect to see when taking a neural network OOD. It’s not that the model does some insane path-dependent thing, it collapses to entropy. You end up seeing a max-entropy distribution over outputs not goals. This is a good example of the kind of thing that’s obvious to people who’ve done real work with ml but very counter to classic LessWrong intuitions and isn’t learnable by implementing mingpt.
ML models in the current paradigm do not seem to behave coherently OOD but I’d bet for nearly any metric of “overall capability” and alignment that the capability metric decays faster vs alignment as we go further OOD.
See https://arxiv.org/abs/2310.00873 for an example of the kinds of things you’d expect to see when taking a neural network OOD. It’s not that the model does some insane path-dependent thing, it collapses to entropy. You end up seeing a max-entropy distribution over outputs not goals. This is a good example of the kind of thing that’s obvious to people who’ve done real work with ml but very counter to classic LessWrong intuitions and isn’t learnable by implementing mingpt.
<snark> Your models of intelligent systems collapse to entropy on OOD intelligence levels. </snark>