Voyager is a scaffolded LLM agent that plays Minecraft decently well (by pulling in a textual description of the game state, and writing code interfacing with an API). It is based on some very detailed prompting (see the appendix), but obviously could not function without the higher-level control flow and several distinct components that the scaffolding implements.
That’s a good example, thank you! I actually now remembered looking at this a few weeks ago and thinking about it as an interesting example of scaffolding. Thanks for reminding me.
I agree that an end-to-end trained agent could be trained to be better. But such training is expensive, and it seems like for many tasks, before we see an end-to-end trained model doing well at it, someone will hack together some scaffold monstrosity that does it passably well. In general, the training/inference compute asymmetry means that using even relatively large amounts of inference to replicate the performance of a larger / more-trained system on a task may be surprisingly competitive.
I do wonder how much of this is just the result of an access gap. Getting one of these scaffolded systems to work seems also a lot of hassle and very fiddly, and my best guess is that if OpenAI wanted to solve this problem, they would probably just reinforcement learn a bunch, and then maybe they would do a bit of scaffolding, but the scaffolding would be a lot less detailed and not really be that important to the overall performance of the system.
That’s a good example, thank you! I actually now remembered looking at this a few weeks ago and thinking about it as an interesting example of scaffolding. Thanks for reminding me.
I do wonder how much of this is just the result of an access gap. Getting one of these scaffolded systems to work seems also a lot of hassle and very fiddly, and my best guess is that if OpenAI wanted to solve this problem, they would probably just reinforcement learn a bunch, and then maybe they would do a bit of scaffolding, but the scaffolding would be a lot less detailed and not really be that important to the overall performance of the system.