Methodologically, I think it would make more sense to frame it in terms of action granularity ratio, rather than using units like seconds or %s. The use of seconds here seems to make the numbers much more awkward. It’d be more natural to talk about scaling trends for Elo vs action-temporal granularity. For example, ’a 1:2 action ratio translates to a 1:3 win ratio advantage (+500 Elo)” or whatever. This lets you investigate arbitrary ratios like 3:2 and fill out the curves. (You’d wind up doing a transform like this anyway.)
Then you can start easily going through various scaling laws, like additional finetuning samples or parameter scaling vs Elo, and bring in the relevant DRL scaling literature like Jones and temporal scaling laws for horizons/duration. (For example, you could look at horizon scaling in terms of training samples: break up each full Starcraft episode to train on increasingly truncated samples.) The thresholds you talk about might be related to the irreducible loss of the horizon RL scaling law: if there is something that happens “too quick” each action-timestep, and there is no way to take actions which affect too-quick state changes, then those too-quick events will be irreducible by agents.
Thanks for the excellent feedback. I did consider action ratio at first, but it does have some slightly different considerations that made it a little challenging to do for an initial pass. The first is based on current limitations with the TextSC2 framework—there isn’t a way to obtain detailed action logs for the in-game AI the same way we can for our agents, so it would require an “agent vs agent” setup instead of “agent vs in-game AI”. And while TextSC2 supports this, it currently does not allow for real-time play when doing it (probably because “agent vs agent” setups require running two instances of the game at once on the same machine, which would cause performance degradation, and there isn’t any netcode to run multiplayer games over a network). With that said, SC2 is a 15 year old game at this point, and if someone has state-of-the-art hardware, it should be possible to run both instances with at least 40-50 fps, so this is something I would like to work on improving within the framework.
The second consideration is that not all actions are temporally equivalent, with some actions taking longer than others, and so it may not be a true “apples to apples” comparison if both agents employ different strategies that utilize different mixes of actions. We would probably either have to weight each action differently, or increase sample size to smooth out the noise, or both.
Regarding horizon lengths and scaling, I agree that this would be a great next direction of exploration, and suspect you may be correct regarding irreducible loss here. More broadly, it would be great to establish scaling laws that apply across different adversarial environments (beyond SC2). I think this could have a significant impact on a lot of the discourse around AI risk.
It sounds like SC2 might just be a bad testbed here. You should not have to be dealing with issues like “but can I get a computer fast enough to run it at a fast enough speedup”—that’s just silly and a big waste of your effort. Before you sink any more costs into shaving those and other yaks, it’s time to look for POMDPs which at least can be paused & resumed appropriately and have sane tooling, or better yet, have continuous actions/time so you can examine arbitrary ratios.
Also, I should have probably pointed out that one issue with using LLMs you aren’t training from scratch is that you have to deal with the changing action ratios pushing the agents increasingly off-policy. The fact that they are not trained or drawing from other tasks with similarly varying time ratios means that the worsening performance with worsening ratio is partially illusory: the slower player could play better than it does, it just doesn’t know how, because it was trained on other ratios. The kind of play one would engage in at 1:1 is different from the kind of play one would do at 10:1, or 1:10; eg a faster agent will micro the heck out of SC, while a slow agent will probably try to rely much more on automated base defenses which attack in realtime without orders and emphasize economy & grand strategy, that sort of thing. (This was also an issue with the chess hobbling experiments: Stockfish is going to do very badly when hobbled enough, like removing its queen, because it was never trained on such bizarre impossible scenarios / alternate rulesets.) Which is bad if you are using this as some sort of AI safety argument, because it will systematically deceive you, based on the hobbled off-policy agents, into thinking slowed-down agents are less capable (ie. safer) in general than they really are. This is another reason to not use SC2 or try to rely on transfer from a pre-existing model, convenient as the latter may be.
Given both these issues, you should probably think about instead more Jones-like training an agent from scratch, simultaneously at all ratios to meta-learn competency at all ratios while sharing training in a fair fashion, on a much simpler environment. Maybe not even a POMDP, MDPs might be adequate for most of it. Something like a large tic-tac-toe board, or perhaps a continuous Pong, would be simple enough that you could afford to train very competent unhobbled agents at widely-varying ratios, and fit various scaling laws, with few GPUs.
Regarding the second issue (the point that the LLM may not know how to play at different time ratios) - I wonder whether this may be true for current LLM and DRL systems where inference happens in a single step (and where intelligence is derived mainly from training), BUT potentially not true for the next crop of reasoners, where a significant portion of intelligence is derived from the reasoning step that happens at inference time and which we are hoping to target with this scheme. One can imagine that sufficiently advanced AI, even if not explicitly trained for a given task, will eventually succeed at that task, by reasoning about it and extrapolating to new knowledge (as humans do) versus interpolating from limited training data. We could eventually have an “o2-preview” or “o3-preview” model that, through deductive reasoning and without any additional training, is able to figure out that it’s moving too slow to be effective, and adjust its strategy to rely more on grand strategy. This is the regime of intelligence that I think could be most vulnerable to a slowdown.
As to the first issue, there are lightweight RTS frameworks (eg, microRTS) that consume minimal compute and can be run in parallel, but without any tooling for LLMs specifically. I thought TextSC2 would be a good base because it not only offers this, but a few other advantages as well: 1) sufficiently deep action and strategy spaces that provide AI agents with enough degrees of freedom to develop emergent behavior, and 2) sufficient popularity whereby human evaluators can better understand, evaluate, and supervise AI agent strategy and behavior. With that said, if there are POMDPs or MDPs that you think are better fits, I would be very happy to check them out!
Methodologically, I think it would make more sense to frame it in terms of action granularity ratio, rather than using units like seconds or %s. The use of seconds here seems to make the numbers much more awkward. It’d be more natural to talk about scaling trends for Elo vs action-temporal granularity. For example, ’a 1:2 action ratio translates to a 1:3 win ratio advantage (+500 Elo)” or whatever. This lets you investigate arbitrary ratios like 3:2 and fill out the curves. (You’d wind up doing a transform like this anyway.)
Then you can start easily going through various scaling laws, like additional finetuning samples or parameter scaling vs Elo, and bring in the relevant DRL scaling literature like Jones and temporal scaling laws for horizons/duration. (For example, you could look at horizon scaling in terms of training samples: break up each full Starcraft episode to train on increasingly truncated samples.) The thresholds you talk about might be related to the irreducible loss of the horizon RL scaling law: if there is something that happens “too quick” each action-timestep, and there is no way to take actions which affect too-quick state changes, then those too-quick events will be irreducible by agents.
Thanks for the excellent feedback. I did consider action ratio at first, but it does have some slightly different considerations that made it a little challenging to do for an initial pass. The first is based on current limitations with the TextSC2 framework—there isn’t a way to obtain detailed action logs for the in-game AI the same way we can for our agents, so it would require an “agent vs agent” setup instead of “agent vs in-game AI”. And while TextSC2 supports this, it currently does not allow for real-time play when doing it (probably because “agent vs agent” setups require running two instances of the game at once on the same machine, which would cause performance degradation, and there isn’t any netcode to run multiplayer games over a network). With that said, SC2 is a 15 year old game at this point, and if someone has state-of-the-art hardware, it should be possible to run both instances with at least 40-50 fps, so this is something I would like to work on improving within the framework.
The second consideration is that not all actions are temporally equivalent, with some actions taking longer than others, and so it may not be a true “apples to apples” comparison if both agents employ different strategies that utilize different mixes of actions. We would probably either have to weight each action differently, or increase sample size to smooth out the noise, or both.
Regarding horizon lengths and scaling, I agree that this would be a great next direction of exploration, and suspect you may be correct regarding irreducible loss here. More broadly, it would be great to establish scaling laws that apply across different adversarial environments (beyond SC2). I think this could have a significant impact on a lot of the discourse around AI risk.
+
It sounds like SC2 might just be a bad testbed here. You should not have to be dealing with issues like “but can I get a computer fast enough to run it at a fast enough speedup”—that’s just silly and a big waste of your effort. Before you sink any more costs into shaving those and other yaks, it’s time to look for POMDPs which at least can be paused & resumed appropriately and have sane tooling, or better yet, have continuous actions/time so you can examine arbitrary ratios.
Also, I should have probably pointed out that one issue with using LLMs you aren’t training from scratch is that you have to deal with the changing action ratios pushing the agents increasingly off-policy. The fact that they are not trained or drawing from other tasks with similarly varying time ratios means that the worsening performance with worsening ratio is partially illusory: the slower player could play better than it does, it just doesn’t know how, because it was trained on other ratios. The kind of play one would engage in at 1:1 is different from the kind of play one would do at 10:1, or 1:10; eg a faster agent will micro the heck out of SC, while a slow agent will probably try to rely much more on automated base defenses which attack in realtime without orders and emphasize economy & grand strategy, that sort of thing. (This was also an issue with the chess hobbling experiments: Stockfish is going to do very badly when hobbled enough, like removing its queen, because it was never trained on such bizarre impossible scenarios / alternate rulesets.) Which is bad if you are using this as some sort of AI safety argument, because it will systematically deceive you, based on the hobbled off-policy agents, into thinking slowed-down agents are less capable (ie. safer) in general than they really are. This is another reason to not use SC2 or try to rely on transfer from a pre-existing model, convenient as the latter may be.
Given both these issues, you should probably think about instead more Jones-like training an agent from scratch, simultaneously at all ratios to meta-learn competency at all ratios while sharing training in a fair fashion, on a much simpler environment. Maybe not even a POMDP, MDPs might be adequate for most of it. Something like a large tic-tac-toe board, or perhaps a continuous Pong, would be simple enough that you could afford to train very competent unhobbled agents at widely-varying ratios, and fit various scaling laws, with few GPUs.
Regarding the second issue (the point that the LLM may not know how to play at different time ratios) - I wonder whether this may be true for current LLM and DRL systems where inference happens in a single step (and where intelligence is derived mainly from training), BUT potentially not true for the next crop of reasoners, where a significant portion of intelligence is derived from the reasoning step that happens at inference time and which we are hoping to target with this scheme. One can imagine that sufficiently advanced AI, even if not explicitly trained for a given task, will eventually succeed at that task, by reasoning about it and extrapolating to new knowledge (as humans do) versus interpolating from limited training data. We could eventually have an “o2-preview” or “o3-preview” model that, through deductive reasoning and without any additional training, is able to figure out that it’s moving too slow to be effective, and adjust its strategy to rely more on grand strategy. This is the regime of intelligence that I think could be most vulnerable to a slowdown.
As to the first issue, there are lightweight RTS frameworks (eg, microRTS) that consume minimal compute and can be run in parallel, but without any tooling for LLMs specifically. I thought TextSC2 would be a good base because it not only offers this, but a few other advantages as well: 1) sufficiently deep action and strategy spaces that provide AI agents with enough degrees of freedom to develop emergent behavior, and 2) sufficient popularity whereby human evaluators can better understand, evaluate, and supervise AI agent strategy and behavior. With that said, if there are POMDPs or MDPs that you think are better fits, I would be very happy to check them out!