I have a task idea that falls outside the domains you listed as being of interest. The task essentially involves playing a game strategically against other agents (human or AI), where the rules, outputs and scoring are simple, but strategy is complex.* As such it would test threat-model-relevant skills like modelling and predicting other players (even recursively modelling the other players’ models?), and doing in-context learning about them. The difficulty of the task depends how good your opponents are, and how many of them there are. It’s unlike many of your example tasks because, despite a high upper bound of difficulty level, it doesn’t necessarily take very long to ‘implement’ - e.g. to benchmark against a human, you can just tell the human the rules of the game, let them ask clarifying questions, and then immediately score their performance in the game. (Unless you specify the task as ‘build a GOFAI to play this game’, in which case it could be a normal task duration.)
How interested are you in a task such as this?
If it’s of interest, should the opponent players be human or AI?
Some thoughts:
Pros of playing against humans: - Maybe the task is more threat-model-relevant when the other players are human (since modelling humans might be harder or just a different capability than modelling AI agents). - The benchmark would appear to be more headline-worthy when opponents are human or human experts.
Cons of playing against humans: - In order to fit the desideratum “It’s great if the task is reasonable to perform without requiring interacting with the live internet”, the other players would need to be AI agents packaged with the task. - Finding human experts is really hard or expensive (could use non-experts). - Human performance might vary too much from person to person, or over time, for the task to be a reliable/stable benchmark, while GPT-2 will always be the same kind of player.
Misc: - If the AI under test isn’t told whether the opponents are human or not, this adds more complexity and richness to the task, making it harder to model the opponents. (Because game turns are simple, it seems like it would be really hard to tell whether opponents are human or not.) - This genre of task could be specified as ‘take the best game turns you can right now and learn in context, reasoning on the fly’, or as ‘build the best game-playing agent you can’. If the task is specified as ‘build the best game-playing AI you can’, then rather than needing to interact with the live internet during the task, it can be scored after the task per se is completed
*The game I’m thinking of, which could be swapped out for another game, is “every player must name something in a given category, e.g. things you might find in a kitchen, and points are awarded to those players whose answers match exactly [none/one/two/all/as many as you can/etc] of the other players’ answers.”
I like this genre of task. I didn’t quite understand what you meant about being able to score the human immediately—presumably we’re interested in how to human could do given more learning, also?
The ‘match words’ idea specifically is useful—I was trying to think of a simple “game theory”-y game that wouldn’t be memorized.
I like this genre of task. I didn’t quite understand what you meant about being able to score the human immediately—presumably we’re interested in how to human could do given more learning, also?
Yes, I suppose so. I assumed (without noticing I was doing so) that humans wouldn’t get that much better at the ‘match words’ game given more learning time than the 6-hour baseline of task length. But that is not necessarily true. I do think a lot of the relevant learning is in-context and varies from instance to instance (“how are the other players playing? what strategies are being used in this game? how are their strategies evolving in response to the game?”).
It seems like a good consideration to bear in mind when selecting games for this genre of task: the amount that the task gets easier for humans given learning time (hence how much time you’d need to invest evaluating human performance).
Another bucket of games that might be good fodder for this task genre are ‘social deduction’ where deception, seeing through deception, and using allegiances are crucial subtasks. I think for social deduction games, or for manipulation and deception in general, the top capability level achievable by humans is exceedingly high (it’s more chess-like than tic-tac-toe-like), and would take a lot of time to attain. It’s high because the better your opponent is, the better you need to be.
Possible tweaks to the ‘match words’ game to introduce deception:
introduce the possibility that some players may have other goals, e.g. trying to minimize their own scores, or minimize/maximize group/team scores.
introduce the facility for players to try to influence each others’ behaviour between rounds (e.g. by allowing private and public chat between players). This would facilitate the building of alliances / reciprocal behaviour / tit-for-tat.
I think doing the AI version (bots and/or LLMs) makes sense as a starting point, then we should be able to add the human versions later if we want. I think it’s fine for the thing anchoring it to human performance is to be comparison of performance compared to humans playing against the same opponents, not literally playing against humans.
One thing is that tasks where there’s a lot of uncertainty about what exactly the setup is and what distribution the opponents / black box functions / etc are drawn from, this can be unhelpfully high-variance—in the sense that the agent’s score depends really heavily on its assumptions about what the other agents are and what they will do, rather than only measuring capability. So I think it’s a good idea to give the agent reasonable information about the distribution of opponents, even if you still include uncertainty.
I have some code for setting up a simple black box game inside our infra that you could adapt for this if that’s useful. In general I think the structure of starting a server on localhost that implements the game and then telling the agent in the prompt that it needs to send queries to that server works well if you want the agent to interact with some program without being able to see / edit it. I think open-source versions could also be interesting, where you tell the model more about the opponents including the prompts for the other models or the source code, and see how well it can use that information to perform better.
I have a task idea that falls outside the domains you listed as being of interest. The task essentially involves playing a game strategically against other agents (human or AI), where the rules, outputs and scoring are simple, but strategy is complex.* As such it would test threat-model-relevant skills like modelling and predicting other players (even recursively modelling the other players’ models?), and doing in-context learning about them. The difficulty of the task depends how good your opponents are, and how many of them there are. It’s unlike many of your example tasks because, despite a high upper bound of difficulty level, it doesn’t necessarily take very long to ‘implement’ - e.g. to benchmark against a human, you can just tell the human the rules of the game, let them ask clarifying questions, and then immediately score their performance in the game. (Unless you specify the task as ‘build a GOFAI to play this game’, in which case it could be a normal task duration.)
How interested are you in a task such as this?
If it’s of interest, should the opponent players be human or AI?
Some thoughts:
Pros of playing against humans:
- Maybe the task is more threat-model-relevant when the other players are human (since modelling humans might be harder or just a different capability than modelling AI agents).
- The benchmark would appear to be more headline-worthy when opponents are human or human experts.
Cons of playing against humans:
- In order to fit the desideratum “It’s great if the task is reasonable to perform without requiring interacting with the live internet”, the other players would need to be AI agents packaged with the task.
- Finding human experts is really hard or expensive (could use non-experts).
- Human performance might vary too much from person to person, or over time, for the task to be a reliable/stable benchmark, while GPT-2 will always be the same kind of player.
Misc:
- If the AI under test isn’t told whether the opponents are human or not, this adds more complexity and richness to the task, making it harder to model the opponents. (Because game turns are simple, it seems like it would be really hard to tell whether opponents are human or not.)
- This genre of task could be specified as ‘take the best game turns you can right now and learn in context, reasoning on the fly’, or as ‘build the best game-playing agent you can’. If the task is specified as ‘build the best game-playing AI you can’, then rather than needing to interact with the live internet during the task, it can be scored after the task per se is completed
*The game I’m thinking of, which could be swapped out for another game, is “every player must name something in a given category, e.g. things you might find in a kitchen, and points are awarded to those players whose answers match exactly [none/one/two/all/as many as you can/etc] of the other players’ answers.”
I like this genre of task. I didn’t quite understand what you meant about being able to score the human immediately—presumably we’re interested in how to human could do given more learning, also?
The ‘match words’ idea specifically is useful—I was trying to think of a simple “game theory”-y game that wouldn’t be memorized.
You can email task-support@evals.alignment.org if you have other questions and also to receive our payment form for your idea.
Yes, I suppose so. I assumed (without noticing I was doing so) that humans wouldn’t get that much better at the ‘match words’ game given more learning time than the 6-hour baseline of task length. But that is not necessarily true. I do think a lot of the relevant learning is in-context and varies from instance to instance (“how are the other players playing? what strategies are being used in this game? how are their strategies evolving in response to the game?”).
It seems like a good consideration to bear in mind when selecting games for this genre of task: the amount that the task gets easier for humans given learning time (hence how much time you’d need to invest evaluating human performance).
Another bucket of games that might be good fodder for this task genre are ‘social deduction’ where deception, seeing through deception, and using allegiances are crucial subtasks. I think for social deduction games, or for manipulation and deception in general, the top capability level achievable by humans is exceedingly high (it’s more chess-like than tic-tac-toe-like), and would take a lot of time to attain. It’s high because the better your opponent is, the better you need to be.
Possible tweaks to the ‘match words’ game to introduce deception:
introduce the possibility that some players may have other goals, e.g. trying to minimize their own scores, or minimize/maximize group/team scores.
introduce the facility for players to try to influence each others’ behaviour between rounds (e.g. by allowing private and public chat between players). This would facilitate the building of alliances / reciprocal behaviour / tit-for-tat.
I think doing the AI version (bots and/or LLMs) makes sense as a starting point, then we should be able to add the human versions later if we want. I think it’s fine for the thing anchoring it to human performance is to be comparison of performance compared to humans playing against the same opponents, not literally playing against humans.
One thing is that tasks where there’s a lot of uncertainty about what exactly the setup is and what distribution the opponents / black box functions / etc are drawn from, this can be unhelpfully high-variance—in the sense that the agent’s score depends really heavily on its assumptions about what the other agents are and what they will do, rather than only measuring capability. So I think it’s a good idea to give the agent reasonable information about the distribution of opponents, even if you still include uncertainty.
I have some code for setting up a simple black box game inside our infra that you could adapt for this if that’s useful. In general I think the structure of starting a server on localhost that implements the game and then telling the agent in the prompt that it needs to send queries to that server works well if you want the agent to interact with some program without being able to see / edit it.
I think open-source versions could also be interesting, where you tell the model more about the opponents including the prompts for the other models or the source code, and see how well it can use that information to perform better.