I think this is an interesting line of thinking. My main concern is whether the alignment tax might be too high for some use cases. I think a test case that it might do well for would be the NetHack challenge https://nethackchallenge.com/
I think that that’s an interesting challenge because current SoTA is far from ceiling effects on it, and that imitating human games as a starting point seems intuitively like a good approach. To study the question of: how likely is the model to go problematic directions once off-distribution, you could modify the NetHack challenge environment to add some possible exploits which aren’t in the real game, and see how likely it is for the model to find and use those exploits.
I think this is an interesting line of thinking. My main concern is whether the alignment tax might be too high for some use cases. I think a test case that it might do well for would be the NetHack challenge https://nethackchallenge.com/
I think that that’s an interesting challenge because current SoTA is far from ceiling effects on it, and that imitating human games as a starting point seems intuitively like a good approach. To study the question of: how likely is the model to go problematic directions once off-distribution, you could modify the NetHack challenge environment to add some possible exploits which aren’t in the real game, and see how likely it is for the model to find and use those exploits.