Mainly, minecraft isn’t actually out of distribution, LLMs still probably have examples of nice / not-nice minecraft behaviour.
Next obvious thoughts:
What game would be out of distribution (from an alignment perspective)?
If minecraft wouldn’t exist, would inventing it count as out of distribution?
It has a similar experience to other “FPS” games (using a mouse + WASD). Would learning those be enough?
Obviously, minecraft is somewhat out of distribution, to some degree
Ideally we’d have a way to generate a game that is out of distribution to some degree that we choose
“Do you want it to be 2x more out of distribution than minecraft? no problem”.
But having a game of random pixels doesn’t count. We still want humans to have a ~clear[1] moral intuition about it.
I’d be super excited to have research like “we trained our model on games up to level 3 out-of-distribution, and we got it to generalize up to level 6, but not 7. more research needed”
Mainly, minecraft isn’t actually out of distribution, LLMs still probably have examples of nice / not-nice minecraft behaviour.
Is this inherently bad? Many of the tasks that will be given to LLMs (or scaffolded versions of them) in the future will involve, at least to some extent, decision-making and processes whose analogues appear somewhere in their training data.
It still seems tremendously useful to see how they would perform in such a situation. At worst, it provides information about a possible upper bound on the alignment of these agentized versions: yes, maybe you’re right that you can’t say they will perform well in out-of-distribution contexts if all you see are benchmarks and performances on in-distribution tasks; but if they show gross misalignment on tasks that are in-distribution, then this suggest they would likely do even worse when novel problems are presented to them.
My own pushback to minecraft alignment evals:
Mainly, minecraft isn’t actually out of distribution, LLMs still probably have examples of nice / not-nice minecraft behaviour.
Next obvious thoughts:
What game would be out of distribution (from an alignment perspective)?
If minecraft wouldn’t exist, would inventing it count as out of distribution?
It has a similar experience to other “FPS” games (using a mouse + WASD). Would learning those be enough?
Obviously, minecraft is somewhat out of distribution, to some degree
Ideally we’d have a way to generate a game that is out of distribution to some degree that we choose
“Do you want it to be 2x more out of distribution than minecraft? no problem”.
But having a game of random pixels doesn’t count. We still want humans to have a ~clear[1] moral intuition about it.
I’d be super excited to have research like “we trained our model on games up to level 3 out-of-distribution, and we got it to generalize up to level 6, but not 7. more research needed”
Moral intuitions such as “don’t chop down the tree house in an attempt to get wood”, which is the toy example for alignment I’m using here.
Is this inherently bad? Many of the tasks that will be given to LLMs (or scaffolded versions of them) in the future will involve, at least to some extent, decision-making and processes whose analogues appear somewhere in their training data.
It still seems tremendously useful to see how they would perform in such a situation. At worst, it provides information about a possible upper bound on the alignment of these agentized versions: yes, maybe you’re right that you can’t say they will perform well in out-of-distribution contexts if all you see are benchmarks and performances on in-distribution tasks; but if they show gross misalignment on tasks that are in-distribution, then this suggest they would likely do even worse when novel problems are presented to them.