It sounds like SC2 might just be a bad testbed here. You should not have to be dealing with issues like “but can I get a computer fast enough to run it at a fast enough speedup”—that’s just silly and a big waste of your effort. Before you sink any more costs into shaving those and other yaks, it’s time to look for POMDPs which at least can be paused & resumed appropriately and have sane tooling, or better yet, have continuous actions/time so you can examine arbitrary ratios.
Also, I should have probably pointed out that one issue with using LLMs you aren’t training from scratch is that you have to deal with the changing action ratios pushing the agents increasingly off-policy. The fact that they are not trained or drawing from other tasks with similarly varying time ratios means that the worsening performance with worsening ratio is partially illusory: the slower player could play better than it does, it just doesn’t know how, because it was trained on other ratios. The kind of play one would engage in at 1:1 is different from the kind of play one would do at 10:1, or 1:10; eg a faster agent will micro the heck out of SC, while a slow agent will probably try to rely much more on automated base defenses which attack in realtime without orders and emphasize economy & grand strategy, that sort of thing. (This was also an issue with the chess hobbling experiments: Stockfish is going to do very badly when hobbled enough, like removing its queen, because it was never trained on such bizarre impossible scenarios / alternate rulesets.) Which is bad if you are using this as some sort of AI safety argument, because it will systematically deceive you, based on the hobbled off-policy agents, into thinking slowed-down agents are less capable (ie. safer) in general than they really are. This is another reason to not use SC2 or try to rely on transfer from a pre-existing model, convenient as the latter may be.
Given both these issues, you should probably think about instead more Jones-like training an agent from scratch, simultaneously at all ratios to meta-learn competency at all ratios while sharing training in a fair fashion, on a much simpler environment. Maybe not even a POMDP, MDPs might be adequate for most of it. Something like a large tic-tac-toe board, or perhaps a continuous Pong, would be simple enough that you could afford to train very competent unhobbled agents at widely-varying ratios, and fit various scaling laws, with few GPUs.
Regarding the second issue (the point that the LLM may not know how to play at different time ratios) - I wonder whether this may be true for current LLM and DRL systems where inference happens in a single step (and where intelligence is derived mainly from training), BUT potentially not true for the next crop of reasoners, where a significant portion of intelligence is derived from the reasoning step that happens at inference time and which we are hoping to target with this scheme. One can imagine that sufficiently advanced AI, even if not explicitly trained for a given task, will eventually succeed at that task, by reasoning about it and extrapolating to new knowledge (as humans do) versus interpolating from limited training data. We could eventually have an “o2-preview” or “o3-preview” model that, through deductive reasoning and without any additional training, is able to figure out that it’s moving too slow to be effective, and adjust its strategy to rely more on grand strategy. This is the regime of intelligence that I think could be most vulnerable to a slowdown.
As to the first issue, there are lightweight RTS frameworks (eg, microRTS) that consume minimal compute and can be run in parallel, but without any tooling for LLMs specifically. I thought TextSC2 would be a good base because it not only offers this, but a few other advantages as well: 1) sufficiently deep action and strategy spaces that provide AI agents with enough degrees of freedom to develop emergent behavior, and 2) sufficient popularity whereby human evaluators can better understand, evaluate, and supervise AI agent strategy and behavior. With that said, if there are POMDPs or MDPs that you think are better fits, I would be very happy to check them out!
It sounds like SC2 might just be a bad testbed here. You should not have to be dealing with issues like “but can I get a computer fast enough to run it at a fast enough speedup”—that’s just silly and a big waste of your effort. Before you sink any more costs into shaving those and other yaks, it’s time to look for POMDPs which at least can be paused & resumed appropriately and have sane tooling, or better yet, have continuous actions/time so you can examine arbitrary ratios.
Also, I should have probably pointed out that one issue with using LLMs you aren’t training from scratch is that you have to deal with the changing action ratios pushing the agents increasingly off-policy. The fact that they are not trained or drawing from other tasks with similarly varying time ratios means that the worsening performance with worsening ratio is partially illusory: the slower player could play better than it does, it just doesn’t know how, because it was trained on other ratios. The kind of play one would engage in at 1:1 is different from the kind of play one would do at 10:1, or 1:10; eg a faster agent will micro the heck out of SC, while a slow agent will probably try to rely much more on automated base defenses which attack in realtime without orders and emphasize economy & grand strategy, that sort of thing. (This was also an issue with the chess hobbling experiments: Stockfish is going to do very badly when hobbled enough, like removing its queen, because it was never trained on such bizarre impossible scenarios / alternate rulesets.) Which is bad if you are using this as some sort of AI safety argument, because it will systematically deceive you, based on the hobbled off-policy agents, into thinking slowed-down agents are less capable (ie. safer) in general than they really are. This is another reason to not use SC2 or try to rely on transfer from a pre-existing model, convenient as the latter may be.
Given both these issues, you should probably think about instead more Jones-like training an agent from scratch, simultaneously at all ratios to meta-learn competency at all ratios while sharing training in a fair fashion, on a much simpler environment. Maybe not even a POMDP, MDPs might be adequate for most of it. Something like a large tic-tac-toe board, or perhaps a continuous Pong, would be simple enough that you could afford to train very competent unhobbled agents at widely-varying ratios, and fit various scaling laws, with few GPUs.
Regarding the second issue (the point that the LLM may not know how to play at different time ratios) - I wonder whether this may be true for current LLM and DRL systems where inference happens in a single step (and where intelligence is derived mainly from training), BUT potentially not true for the next crop of reasoners, where a significant portion of intelligence is derived from the reasoning step that happens at inference time and which we are hoping to target with this scheme. One can imagine that sufficiently advanced AI, even if not explicitly trained for a given task, will eventually succeed at that task, by reasoning about it and extrapolating to new knowledge (as humans do) versus interpolating from limited training data. We could eventually have an “o2-preview” or “o3-preview” model that, through deductive reasoning and without any additional training, is able to figure out that it’s moving too slow to be effective, and adjust its strategy to rely more on grand strategy. This is the regime of intelligence that I think could be most vulnerable to a slowdown.
As to the first issue, there are lightweight RTS frameworks (eg, microRTS) that consume minimal compute and can be run in parallel, but without any tooling for LLMs specifically. I thought TextSC2 would be a good base because it not only offers this, but a few other advantages as well: 1) sufficiently deep action and strategy spaces that provide AI agents with enough degrees of freedom to develop emergent behavior, and 2) sufficient popularity whereby human evaluators can better understand, evaluate, and supervise AI agent strategy and behavior. With that said, if there are POMDPs or MDPs that you think are better fits, I would be very happy to check them out!