Currently, we do not know how to make sure machine learning generalises well out of sample. This is an open problem that is critical to alignment. I find that it’s left out of evals frustratingly often, probably because it’s hard, and most methods miserably fail to generalise OOD.
For example, you don’t want your ASI to become unaligned, have value drift, or extrapolate human values poorly when, for example, 1) it meets aliens, 2) 1000 years pass, or cultural drift happens. What if your descendants think it’s admirable and funny to take hostages as a form of artistic practical joke, you would hope that your AI’s would handle that in a principled and adaptable manner. At the very least, you want its capability to fail before its morality.
I would add:
Must also generalise better than capabilities!
out of distribution
to smarter models
Currently, we do not know how to make sure machine learning generalises well out of sample. This is an open problem that is critical to alignment. I find that it’s left out of evals frustratingly often, probably because it’s hard, and most methods miserably fail to generalise OOD.
For example, you don’t want your ASI to become unaligned, have value drift, or extrapolate human values poorly when, for example, 1) it meets aliens, 2) 1000 years pass, or cultural drift happens. What if your descendants think it’s admirable and funny to take hostages as a form of artistic practical joke, you would hope that your AI’s would handle that in a principled and adaptable manner. At the very least, you want its capability to fail before its morality.
An overlooked benchmark of text OOD: GENIES
mention of how OOD is an open problem in “Towards out of distribution generalization for problems in mechanics”
To the people disagreeing, what part do you disagree with? My main point, or my example? Or something else