VPT learns to play minecraft as well as trained/expert humans
Um, what? This seems wildly false.
Do you think the MineRL BASALT Blue Sky award will get claimed this year? Seems like you should believe it’s almost a sure thing, since it involves finetuning VPT. (I’d offer to bet you on it but I’m one of the organizers of MineRL BASALT and so am not going to bet on its outcomes.)
Ok, after reading a bit more about the MineRL competition, I largely agree that “play minecraft as well as trained/expert humans” was false (and also largely contradicted by the model itself, as VPT doesn’t have near human level training compute), and I’ve updated/changed that to “diamond crafting ability”, which is more specifically accurate.
Your task is to create an agent which can obtain diamond shovel, starting from a random, fresh world . . . Sounds daunting? This used to be a difficult task, but thanks to OpenAI’s VPT models, obtaining diamonds is relatively easy. Building off from this model, your task is to add the part where it uses the diamonds to craft a diamond shovel instead of diamond pickaxe. You can find a baseline solution using the VPT model here. Find the barebone submission template here.
This does suggest—to me—that VPT was an impressive major advance.
After initial reading of the competition rules, it seems there is some compute/training limitation:
Validation: Organizers will inspect the source code of Top 10 submissions to ensure compliance with rules. The submissions will also be retrained to ensure no rules were broken during training (mainly: limited compute and training time).′
But then that isn’t defined (or I can’t find it on the page)?
Given the unknown compute/training time limitations combined with the limitation on learning methods (no reward learning?), I’m pretty uncertain but would probably only put about 20% chance of the Blue Sky award being claimed this year.
Conditional on no compute/training or method limitations and instead use of compute on scale of the VPT foundation training itself ( > 1e22 flops), and another year of research … I would give about 60% chance of the Blue Sky award being claimed.
Submissions are limited to four days of compute on prespecified computing hardware to train models for all of the tasks. Hardware specifications will be shared later on the competition’s AICrowd page. In the previous year’s competition, this machine contained 6 CPU cores, 56GB of RAM and a single K80 GPU (12GB vRAM).
Notably they can use the pretrained VPT to start with. A model that actually played Minecraft as well as humans would have the capabilities to do any of the BASALT tasks so it would then just be a matter of finetuning the model to get it to exhibit those capabilities.
combined with the limitation on learning methods (no reward learning?)
You can use reward learning, what gives you the impression that you can’t? (The retraining involves human contractors who will provide the human feedback for solutions that require this.)
This does suggest—to me—that VPT was an impressive major advance.
I agree that VPT was a clear advance / jump in Minecraft-playing ability. I was just objecting to the “performs as well as humans”. (Similarly I would rate it well below “cat-level”, though I suspect there I have broader disagreements with you on how to relate ANNs and BNNs.)
Similarly I would rate it well below “cat-level”, though I suspect there I have broader disagreements with you on how to relate ANNs and BNNs.
I’m curious what you suspect those broader disagreements are.
So imagine if we had a detailed cat-sim open world game, combined with the equivalent behavioral cloning data: extensive video data from cat eyes (or head cams), inferred skeleton poses, etc. Do you think that the VPT apporach could be trained to effectiveness at that game in a comparable budget? The cat-sim game doesn’t seem intrinsically harder than minecraft to me, as it’s more about navigation, ambush, and hunting rather than tool/puzzle/planning challenges. Cats don’t seem to have great zero-shot puzzle solving and tool using abilities the way larger brained ravens and primates do. Cat skills seem to me more about hand-paw coordination as in action games more like atari which tend to be easier.
Directly controlling a full cat skeleton may be difficult for a VPT-like system, but the cat cortex doesn’t actually do that either—the cat brain relies much more heavily on innate brainstem pattern generators which the cortex controls indirectly (unlike in larger brained primates/humans). The equivalent for VPT would be a SOTA game animation system (eg inverse kinematics + keyframes) which is then indirectly controlled from just keyboard/mouse.
The VPT input video resolution is aggressively downsampled and low-res compared to cat retina, but that also seems mostly fixable with fairly simple known techniques, and perhaps also borrowing from biology like the logarithmic retinoptic projection, retinal tracking, etc. (and in the worst case we could employ bigger guns—there are known techniques from graphics for compressing/approximating sparse/irregular fields such as the outputs from retinal/wavelet transforms using distorted but fully regular dense meshes more suitable for input into the dense matmul based transformer vision pipeline).
So imagine if we had a detailed cat-sim open world game, combined with the equivalent behavioral cloning data: extensive video data from cat eyes (or head cams), inferred skeleton poses, etc. Do you think that the VPT apporach could be trained to effectiveness at that game in a comparable budget?
Most sims are way way less diverse than the real world, which makes them a lot easier. If we somehow imagine that the sim is reflective of real-world diversity, then I don’t expect the VPT approach (with that compute budget) to get to the cat’s level of effectiveness.
Another part of where I’m coming from is that it’s not clear to me that VPT is particularly good at tool / puzzle / planning challenges, as opposed to memorizing the most common strategies that humans use in Minecraft.
You seem to be distinguishing the cat cortex in particular, and think that the cat cortex has a relatively easy time because other subsystems deal with a bunch of complexity. I wasn’t doing that; I was just imagining “impressiveness of a cat” vs “impressiveness of VPT”. I don’t know enough about cats to evaluate whether the thing you’re doing makes sense but I agree that if the cat brain “has an easier time” because of other non-learned systems that you aren’t including in your flops calculation, then your approach (and categorization of VPT as cat-level) makes more sense.
Most sims are way way less diverse than the real world, which makes them a lot easier
Sure but cats don’t really experience/explore much of the world’s diversity. Many housecats don’t see much more than the inside of a single house (and occasionally a vet).
Another part of where I’m coming from is that it’s not clear to me that VPT is particularly good at tool / puzzle / planning challenges, as opposed to memorizing the most common strategies that humans use in Minecraft.
Yeah clearly VPT isn’t learning strategies on it’s own, but the cat isn’t great at that either, and even humans learn much of minecraft from youtube. Cats obviously do have some amount of intrinsic learning, but it seems largely guided by simple instincts like “self-improve at ability to chase/capture smallish objects” (and easily fooled by novel distractors like lasers). So clearly we are comparing different learning algorithms, and the cat’s learning mechanisms are arguably more on the path to human/AGI, even though VPT learns more complex skills (via cloning), and arguably behavioral cloning is close to imitation learning which is a key human ability.
The cortex is more than half of the synapses and thus flops—the brainstem’s flop contribution is a rounding error. But yeah the cortex “has an easier time” learning when the brainstem/oldbrain provides useful innate behaviors (various walking/jumping/etc animations) and proxy self-learning subsystems (like the chasing thing).
Thanks for catching that. I’m just editing that section right now adding VPT as we speak, so I’m glad I caught this comment, as now I’m going to read the paper (and competition link) in more detail. I predict I’ll update close to your position concerning current expert human-level play, my knowledge/prior around minecraft is probably wildly out of date and based on my own limited experiences.
Um, what? This seems wildly false.
Do you think the MineRL BASALT Blue Sky award will get claimed this year? Seems like you should believe it’s almost a sure thing, since it involves finetuning VPT. (I’d offer to bet you on it but I’m one of the organizers of MineRL BASALT and so am not going to bet on its outcomes.)
Ok, after reading a bit more about the MineRL competition, I largely agree that “play minecraft as well as trained/expert humans” was false (and also largely contradicted by the model itself, as VPT doesn’t have near human level training compute), and I’ve updated/changed that to “diamond crafting ability”, which is more specifically accurate.
This does suggest—to me—that VPT was an impressive major advance.
After initial reading of the competition rules, it seems there is some compute/training limitation:
But then that isn’t defined (or I can’t find it on the page)?
Given the unknown compute/training time limitations combined with the limitation on learning methods (no reward learning?), I’m pretty uncertain but would probably only put about 20% chance of the Blue Sky award being claimed this year.
Conditional on no compute/training or method limitations and instead use of compute on scale of the VPT foundation training itself ( > 1e22 flops), and another year of research … I would give about 60% chance of the Blue Sky award being claimed.
How far is that from your estimates?
That all seems reasonable to me.
From the rules:
Notably they can use the pretrained VPT to start with. A model that actually played Minecraft as well as humans would have the capabilities to do any of the BASALT tasks so it would then just be a matter of finetuning the model to get it to exhibit those capabilities.
You can use reward learning, what gives you the impression that you can’t? (The retraining involves human contractors who will provide the human feedback for solutions that require this.)
I agree that VPT was a clear advance / jump in Minecraft-playing ability. I was just objecting to the “performs as well as humans”. (Similarly I would rate it well below “cat-level”, though I suspect there I have broader disagreements with you on how to relate ANNs and BNNs.)
I’m curious what you suspect those broader disagreements are.
So imagine if we had a detailed cat-sim open world game, combined with the equivalent behavioral cloning data: extensive video data from cat eyes (or head cams), inferred skeleton poses, etc. Do you think that the VPT apporach could be trained to effectiveness at that game in a comparable budget? The cat-sim game doesn’t seem intrinsically harder than minecraft to me, as it’s more about navigation, ambush, and hunting rather than tool/puzzle/planning challenges. Cats don’t seem to have great zero-shot puzzle solving and tool using abilities the way larger brained ravens and primates do. Cat skills seem to me more about hand-paw coordination as in action games more like atari which tend to be easier.
Directly controlling a full cat skeleton may be difficult for a VPT-like system, but the cat cortex doesn’t actually do that either—the cat brain relies much more heavily on innate brainstem pattern generators which the cortex controls indirectly (unlike in larger brained primates/humans). The equivalent for VPT would be a SOTA game animation system (eg inverse kinematics + keyframes) which is then indirectly controlled from just keyboard/mouse.
The VPT input video resolution is aggressively downsampled and low-res compared to cat retina, but that also seems mostly fixable with fairly simple known techniques, and perhaps also borrowing from biology like the logarithmic retinoptic projection, retinal tracking, etc. (and in the worst case we could employ bigger guns—there are known techniques from graphics for compressing/approximating sparse/irregular fields such as the outputs from retinal/wavelet transforms using distorted but fully regular dense meshes more suitable for input into the dense matmul based transformer vision pipeline).
Most sims are way way less diverse than the real world, which makes them a lot easier. If we somehow imagine that the sim is reflective of real-world diversity, then I don’t expect the VPT approach (with that compute budget) to get to the cat’s level of effectiveness.
Another part of where I’m coming from is that it’s not clear to me that VPT is particularly good at tool / puzzle / planning challenges, as opposed to memorizing the most common strategies that humans use in Minecraft.
You seem to be distinguishing the cat cortex in particular, and think that the cat cortex has a relatively easy time because other subsystems deal with a bunch of complexity. I wasn’t doing that; I was just imagining “impressiveness of a cat” vs “impressiveness of VPT”. I don’t know enough about cats to evaluate whether the thing you’re doing makes sense but I agree that if the cat brain “has an easier time” because of other non-learned systems that you aren’t including in your flops calculation, then your approach (and categorization of VPT as cat-level) makes more sense.
Sure but cats don’t really experience/explore much of the world’s diversity. Many housecats don’t see much more than the inside of a single house (and occasionally a vet).
Yeah clearly VPT isn’t learning strategies on it’s own, but the cat isn’t great at that either, and even humans learn much of minecraft from youtube. Cats obviously do have some amount of intrinsic learning, but it seems largely guided by simple instincts like “self-improve at ability to chase/capture smallish objects” (and easily fooled by novel distractors like lasers). So clearly we are comparing different learning algorithms, and the cat’s learning mechanisms are arguably more on the path to human/AGI, even though VPT learns more complex skills (via cloning), and arguably behavioral cloning is close to imitation learning which is a key human ability.
The cortex is more than half of the synapses and thus flops—the brainstem’s flop contribution is a rounding error. But yeah the cortex “has an easier time” learning when the brainstem/oldbrain provides useful innate behaviors (various walking/jumping/etc animations) and proxy self-learning subsystems (like the chasing thing).
Thanks for catching that. I’m just editing that section right now adding VPT as we speak, so I’m glad I caught this comment, as now I’m going to read the paper (and competition link) in more detail. I predict I’ll update close to your position concerning current expert human-level play, my knowledge/prior around minecraft is probably wildly out of date and based on my own limited experiences.