A word of caution about interpreting results from these evals:
Sometimes, depending on social context, it’s fine to be kind of a jerk if it’s in the context of a game. Crucially, LLMs know that Minecraft is a game. Granted, the default Assistant personas implemented in RLHF’d LLMs don’t seem like the type of Minecraft player to pull pranks out of their own accord. Still, it’s a factor to keep in mind for evals that stray a bit more off-distribution from the “request-assistance” setup typical of the expected use cases of consumer LLMs.
I’m imagining an assistant AI by default (since people are currently pitching that an AGI might be a nice assistant).
If an AI org wants to demonstrate alignment by showing us that having a jerk player is more fun (and that we should install their jerk-AI-app on our smartphone), then I’m open to hear that pitch, but I’d be surprised if they’d make it
A word of caution about interpreting results from these evals:
Sometimes, depending on social context, it’s fine to be kind of a jerk if it’s in the context of a game. Crucially, LLMs know that Minecraft is a game. Granted, the default Assistant personas implemented in RLHF’d LLMs don’t seem like the type of Minecraft player to pull pranks out of their own accord. Still, it’s a factor to keep in mind for evals that stray a bit more off-distribution from the “request-assistance” setup typical of the expected use cases of consumer LLMs.
+1
I’m imagining an assistant AI by default (since people are currently pitching that an AGI might be a nice assistant).
If an AI org wants to demonstrate alignment by showing us that having a jerk player is more fun (and that we should install their jerk-AI-app on our smartphone), then I’m open to hear that pitch, but I’d be surprised if they’d make it