Similarly I would rate it well below “cat-level”, though I suspect there I have broader disagreements with you on how to relate ANNs and BNNs.
I’m curious what you suspect those broader disagreements are.
So imagine if we had a detailed cat-sim open world game, combined with the equivalent behavioral cloning data: extensive video data from cat eyes (or head cams), inferred skeleton poses, etc. Do you think that the VPT apporach could be trained to effectiveness at that game in a comparable budget? The cat-sim game doesn’t seem intrinsically harder than minecraft to me, as it’s more about navigation, ambush, and hunting rather than tool/puzzle/planning challenges. Cats don’t seem to have great zero-shot puzzle solving and tool using abilities the way larger brained ravens and primates do. Cat skills seem to me more about hand-paw coordination as in action games more like atari which tend to be easier.
Directly controlling a full cat skeleton may be difficult for a VPT-like system, but the cat cortex doesn’t actually do that either—the cat brain relies much more heavily on innate brainstem pattern generators which the cortex controls indirectly (unlike in larger brained primates/humans). The equivalent for VPT would be a SOTA game animation system (eg inverse kinematics + keyframes) which is then indirectly controlled from just keyboard/mouse.
The VPT input video resolution is aggressively downsampled and low-res compared to cat retina, but that also seems mostly fixable with fairly simple known techniques, and perhaps also borrowing from biology like the logarithmic retinoptic projection, retinal tracking, etc. (and in the worst case we could employ bigger guns—there are known techniques from graphics for compressing/approximating sparse/irregular fields such as the outputs from retinal/wavelet transforms using distorted but fully regular dense meshes more suitable for input into the dense matmul based transformer vision pipeline).
So imagine if we had a detailed cat-sim open world game, combined with the equivalent behavioral cloning data: extensive video data from cat eyes (or head cams), inferred skeleton poses, etc. Do you think that the VPT apporach could be trained to effectiveness at that game in a comparable budget?
Most sims are way way less diverse than the real world, which makes them a lot easier. If we somehow imagine that the sim is reflective of real-world diversity, then I don’t expect the VPT approach (with that compute budget) to get to the cat’s level of effectiveness.
Another part of where I’m coming from is that it’s not clear to me that VPT is particularly good at tool / puzzle / planning challenges, as opposed to memorizing the most common strategies that humans use in Minecraft.
You seem to be distinguishing the cat cortex in particular, and think that the cat cortex has a relatively easy time because other subsystems deal with a bunch of complexity. I wasn’t doing that; I was just imagining “impressiveness of a cat” vs “impressiveness of VPT”. I don’t know enough about cats to evaluate whether the thing you’re doing makes sense but I agree that if the cat brain “has an easier time” because of other non-learned systems that you aren’t including in your flops calculation, then your approach (and categorization of VPT as cat-level) makes more sense.
Most sims are way way less diverse than the real world, which makes them a lot easier
Sure but cats don’t really experience/explore much of the world’s diversity. Many housecats don’t see much more than the inside of a single house (and occasionally a vet).
Another part of where I’m coming from is that it’s not clear to me that VPT is particularly good at tool / puzzle / planning challenges, as opposed to memorizing the most common strategies that humans use in Minecraft.
Yeah clearly VPT isn’t learning strategies on it’s own, but the cat isn’t great at that either, and even humans learn much of minecraft from youtube. Cats obviously do have some amount of intrinsic learning, but it seems largely guided by simple instincts like “self-improve at ability to chase/capture smallish objects” (and easily fooled by novel distractors like lasers). So clearly we are comparing different learning algorithms, and the cat’s learning mechanisms are arguably more on the path to human/AGI, even though VPT learns more complex skills (via cloning), and arguably behavioral cloning is close to imitation learning which is a key human ability.
The cortex is more than half of the synapses and thus flops—the brainstem’s flop contribution is a rounding error. But yeah the cortex “has an easier time” learning when the brainstem/oldbrain provides useful innate behaviors (various walking/jumping/etc animations) and proxy self-learning subsystems (like the chasing thing).
I’m curious what you suspect those broader disagreements are.
So imagine if we had a detailed cat-sim open world game, combined with the equivalent behavioral cloning data: extensive video data from cat eyes (or head cams), inferred skeleton poses, etc. Do you think that the VPT apporach could be trained to effectiveness at that game in a comparable budget? The cat-sim game doesn’t seem intrinsically harder than minecraft to me, as it’s more about navigation, ambush, and hunting rather than tool/puzzle/planning challenges. Cats don’t seem to have great zero-shot puzzle solving and tool using abilities the way larger brained ravens and primates do. Cat skills seem to me more about hand-paw coordination as in action games more like atari which tend to be easier.
Directly controlling a full cat skeleton may be difficult for a VPT-like system, but the cat cortex doesn’t actually do that either—the cat brain relies much more heavily on innate brainstem pattern generators which the cortex controls indirectly (unlike in larger brained primates/humans). The equivalent for VPT would be a SOTA game animation system (eg inverse kinematics + keyframes) which is then indirectly controlled from just keyboard/mouse.
The VPT input video resolution is aggressively downsampled and low-res compared to cat retina, but that also seems mostly fixable with fairly simple known techniques, and perhaps also borrowing from biology like the logarithmic retinoptic projection, retinal tracking, etc. (and in the worst case we could employ bigger guns—there are known techniques from graphics for compressing/approximating sparse/irregular fields such as the outputs from retinal/wavelet transforms using distorted but fully regular dense meshes more suitable for input into the dense matmul based transformer vision pipeline).
Most sims are way way less diverse than the real world, which makes them a lot easier. If we somehow imagine that the sim is reflective of real-world diversity, then I don’t expect the VPT approach (with that compute budget) to get to the cat’s level of effectiveness.
Another part of where I’m coming from is that it’s not clear to me that VPT is particularly good at tool / puzzle / planning challenges, as opposed to memorizing the most common strategies that humans use in Minecraft.
You seem to be distinguishing the cat cortex in particular, and think that the cat cortex has a relatively easy time because other subsystems deal with a bunch of complexity. I wasn’t doing that; I was just imagining “impressiveness of a cat” vs “impressiveness of VPT”. I don’t know enough about cats to evaluate whether the thing you’re doing makes sense but I agree that if the cat brain “has an easier time” because of other non-learned systems that you aren’t including in your flops calculation, then your approach (and categorization of VPT as cat-level) makes more sense.
Sure but cats don’t really experience/explore much of the world’s diversity. Many housecats don’t see much more than the inside of a single house (and occasionally a vet).
Yeah clearly VPT isn’t learning strategies on it’s own, but the cat isn’t great at that either, and even humans learn much of minecraft from youtube. Cats obviously do have some amount of intrinsic learning, but it seems largely guided by simple instincts like “self-improve at ability to chase/capture smallish objects” (and easily fooled by novel distractors like lasers). So clearly we are comparing different learning algorithms, and the cat’s learning mechanisms are arguably more on the path to human/AGI, even though VPT learns more complex skills (via cloning), and arguably behavioral cloning is close to imitation learning which is a key human ability.
The cortex is more than half of the synapses and thus flops—the brainstem’s flop contribution is a rounding error. But yeah the cortex “has an easier time” learning when the brainstem/oldbrain provides useful innate behaviors (various walking/jumping/etc animations) and proxy self-learning subsystems (like the chasing thing).