Regarding the second issue (the point that the LLM may not know how to play at different time ratios) - I wonder whether this may be true for current LLM and DRL systems where inference happens in a single step (and where intelligence is derived mainly from training), BUT potentially not true for the next crop of reasoners, where a significant portion of intelligence is derived from the reasoning step that happens at inference time and which we are hoping to target with this scheme. One can imagine that sufficiently advanced AI, even if not explicitly trained for a given task, will eventually succeed at that task, by reasoning about it and extrapolating to new knowledge (as humans do) versus interpolating from limited training data. We could eventually have an “o2-preview” or “o3-preview” model that, through deductive reasoning and without any additional training, is able to figure out that it’s moving too slow to be effective, and adjust its strategy to rely more on grand strategy. This is the regime of intelligence that I think could be most vulnerable to a slowdown.
As to the first issue, there are lightweight RTS frameworks (eg, microRTS) that consume minimal compute and can be run in parallel, but without any tooling for LLMs specifically. I thought TextSC2 would be a good base because it not only offers this, but a few other advantages as well: 1) sufficiently deep action and strategy spaces that provide AI agents with enough degrees of freedom to develop emergent behavior, and 2) sufficient popularity whereby human evaluators can better understand, evaluate, and supervise AI agent strategy and behavior. With that said, if there are POMDPs or MDPs that you think are better fits, I would be very happy to check them out!
Regarding the second issue (the point that the LLM may not know how to play at different time ratios) - I wonder whether this may be true for current LLM and DRL systems where inference happens in a single step (and where intelligence is derived mainly from training), BUT potentially not true for the next crop of reasoners, where a significant portion of intelligence is derived from the reasoning step that happens at inference time and which we are hoping to target with this scheme. One can imagine that sufficiently advanced AI, even if not explicitly trained for a given task, will eventually succeed at that task, by reasoning about it and extrapolating to new knowledge (as humans do) versus interpolating from limited training data. We could eventually have an “o2-preview” or “o3-preview” model that, through deductive reasoning and without any additional training, is able to figure out that it’s moving too slow to be effective, and adjust its strategy to rely more on grand strategy. This is the regime of intelligence that I think could be most vulnerable to a slowdown.
As to the first issue, there are lightweight RTS frameworks (eg, microRTS) that consume minimal compute and can be run in parallel, but without any tooling for LLMs specifically. I thought TextSC2 would be a good base because it not only offers this, but a few other advantages as well: 1) sufficiently deep action and strategy spaces that provide AI agents with enough degrees of freedom to develop emergent behavior, and 2) sufficient popularity whereby human evaluators can better understand, evaluate, and supervise AI agent strategy and behavior. With that said, if there are POMDPs or MDPs that you think are better fits, I would be very happy to check them out!