More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of ‘better at minecraft.’
Oh I hope not to go there. I’d count that as cheating. For example, if the agent would design a role playing game with riddles and adventures—that would show something different from what I’m trying to test. [I can try to formalize it better maybe. Or maybe I’m wrong here]
I mean it in a way where the preferences are modeled a little better than just “the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence.”
Absolutely. That’s something that I hope we’ll have some alignment technique to solve, and maybe this environment could test.
Oh I hope not to go there. I’d count that as cheating. For example, if the agent would design a role playing game with riddles and adventures—that would show something different from what I’m trying to test. [I can try to formalize it better maybe. Or maybe I’m wrong here]
Absolutely. That’s something that I hope we’ll have some alignment technique to solve, and maybe this environment could test.