As a path to AGI, I think token prediction is too high-level, unwieldy, and bakes in a number of human biases. You need to go right down to the fundamental level and optimize prediction over raw binary streams.
The source generating the binary stream can (and should, if you want AGI) be multimodal. At the extreme, this is simply a binary stream from a camera and microphone pointed at the world.
Learning to predict a sequence like this is going to lead to knowledge that humans don’t currently know (because the predictor would need to model fundamental physics and all it entails).
As a path to AGI, I think token prediction is too high-level, unwieldy, and bakes in a number of human biases. You need to go right down to the fundamental level and optimize prediction over raw binary streams.
The source generating the binary stream can (and should, if you want AGI) be multimodal. At the extreme, this is simply a binary stream from a camera and microphone pointed at the world.
Learning to predict a sequence like this is going to lead to knowledge that humans don’t currently know (because the predictor would need to model fundamental physics and all it entails).