I think it’s possible to be better than humans currently are at minecraft, I can say more if this sounds wrong
Yeah, that’s true. The obvious way is you could have optimized micro, but that’s kinda boring. More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of ‘better at minecraft.’
[what do you mean by preference conflict?]
I mean it in a way where the preferences are modeled a little better than just “the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence.” Sometimes humans appear to act according to fairly straightforward models of goal-directed action. However, the precise model, and the precise goals, may be different at different times (or with different modeling hyperparameters, and of course across different people) - and if you tried to model the human well at all the different times, you’d get a model that looked like physiology and lost the straightforward talk of goals/preferences
Resolving preference conflicts is the process of stitching together larger preferences out of smaller preferences, without changing type signature. The reason literally-interpreted-sentences doesn’t really count is because interpreting them literally is using a smaller model than necessary—you can find a broader explanation for the human’s behavior in context that still comfortably talks about goals/preferences.
More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of ‘better at minecraft.’
Oh I hope not to go there. I’d count that as cheating. For example, if the agent would design a role playing game with riddles and adventures—that would show something different from what I’m trying to test. [I can try to formalize it better maybe. Or maybe I’m wrong here]
I mean it in a way where the preferences are modeled a little better than just “the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence.”
Absolutely. That’s something that I hope we’ll have some alignment technique to solve, and maybe this environment could test.
Yeah, that’s true. The obvious way is you could have optimized micro, but that’s kinda boring. More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of ‘better at minecraft.’
I mean it in a way where the preferences are modeled a little better than just “the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence.” Sometimes humans appear to act according to fairly straightforward models of goal-directed action. However, the precise model, and the precise goals, may be different at different times (or with different modeling hyperparameters, and of course across different people) - and if you tried to model the human well at all the different times, you’d get a model that looked like physiology and lost the straightforward talk of goals/preferences
Resolving preference conflicts is the process of stitching together larger preferences out of smaller preferences, without changing type signature. The reason literally-interpreted-sentences doesn’t really count is because interpreting them literally is using a smaller model than necessary—you can find a broader explanation for the human’s behavior in context that still comfortably talks about goals/preferences.
Oh I hope not to go there. I’d count that as cheating. For example, if the agent would design a role playing game with riddles and adventures—that would show something different from what I’m trying to test. [I can try to formalize it better maybe. Or maybe I’m wrong here]
Absolutely. That’s something that I hope we’ll have some alignment technique to solve, and maybe this environment could test.