Here’s a simpler way to turn a generative model into a policy which doesn’t rely on actions being encoded into the state (which won’t be true in most settings and can’t be true in some—there are no ‘actions’ for a human moving around) or reversing the image generator to the prompt etc: assume your agent harness at least has a list of actions A. (In the case of Minecraft, I guess it’d be the full keyboard + mouse?) Treat Sora as a Decision Transformer, and prompt it with the goal like “A very skilled player creating a diamond, in the grass biome.”, initialized at the current actual agent state. Sample the next displayed state. Now, loop over each action A and add it to the prompt: “The player moves A” and sample the next displayed state. Take whichever action A yielded a sample closest to the original action-free prompt (closest embedding, pixel distance, similar likelihood etc). This figures out what action is being taken by the internal imitated agent by blackbox generation. If unclear (eg. due to perceptual aliasing so the right action & wrong action both lead to immediately the same displayed state), sample deeper and unroll until the consequences do become clear.
This is not an efficient approach at all, but it is a minimal proof of concept about how to extract the implicit agency it has learned from the imitation-learning modeling of humans & other agents. (I say ‘other agents’ to be clear that agency can be learned from anywhere; like, it seems obvious that they are using game engines, and if you are using a game engine, you will probably want it populated by AI agents inside the game for scalability compared to using only human players.)
Here’s a simpler way to turn a generative model into a policy which doesn’t rely on actions being encoded into the state (which won’t be true in most settings and can’t be true in some—there are no ‘actions’ for a human moving around) or reversing the image generator to the prompt etc: assume your agent harness at least has a list of actions A. (In the case of Minecraft, I guess it’d be the full keyboard + mouse?) Treat Sora as a Decision Transformer, and prompt it with the goal like “A very skilled player creating a diamond, in the grass biome.”, initialized at the current actual agent state. Sample the next displayed state. Now, loop over each action A and add it to the prompt: “The player moves A” and sample the next displayed state. Take whichever action A yielded a sample closest to the original action-free prompt (closest embedding, pixel distance, similar likelihood etc). This figures out what action is being taken by the internal imitated agent by blackbox generation. If unclear (eg. due to perceptual aliasing so the right action & wrong action both lead to immediately the same displayed state), sample deeper and unroll until the consequences do become clear.
This is not an efficient approach at all, but it is a minimal proof of concept about how to extract the implicit agency it has learned from the imitation-learning modeling of humans & other agents. (I say ‘other agents’ to be clear that agency can be learned from anywhere; like, it seems obvious that they are using game engines, and if you are using a game engine, you will probably want it populated by AI agents inside the game for scalability compared to using only human players.)