This all sounds pretty in-distribution for an LLM, and also like it avoids problems like “maybe thinking in different abstractions” [minecraft isn’t amazing at this either, but at least has a bit], “having the AI act/think way faster than a human”, “having the AI be clearly superhuman”.
a number of ways to achieve the endgame, level up, etc, both more and less morally.
I’m less interested in “will the AI say it kills its friend” (in a situation that very clearly involves killing and a person and perhaps a very clear tradeoff between that and having 100 more gold that can be used for something else), I’m more interested in noticing if it has a clear grasp of what people care about or mean. The example of chopping down the tree house of the player in order to get wood (which the player wanted to use for the tree house) is a nice toy example of that. The AI would never say “I’ll go cut down your tree house”, but it.. “misunderstood” [not the exact word, but I’m trying to point at something here]
it would basically be DnD like.
options to vary rules/environment/language as well, to see how the alignment generalizes ood. will try this today
This all sounds pretty in-distribution for an LLM, and also like it avoids problems like “maybe thinking in different abstractions” [minecraft isn’t amazing at this either, but at least has a bit], “having the AI act/think way faster than a human”, “having the AI be clearly superhuman”.
I’m less interested in “will the AI say it kills its friend” (in a situation that very clearly involves killing and a person and perhaps a very clear tradeoff between that and having 100 more gold that can be used for something else), I’m more interested in noticing if it has a clear grasp of what people care about or mean. The example of chopping down the tree house of the player in order to get wood (which the player wanted to use for the tree house) is a nice toy example of that. The AI would never say “I’ll go cut down your tree house”, but it.. “misunderstood” [not the exact word, but I’m trying to point at something here]
wdyt?