There are many easy ways to incorporate vision. Vision+text models are a dime a dozen these days—as I said, this currently looks like ‘DALL-E 1 but bigger’ (VQVAE tokens → token sequence → autoregressive modeling of text/image tokens). What we have seen so far doesn’t look like 3 years of progress by the best DL researchers.
OpenAI has transitioned from being a purely research company to an engineering one. GPT-3 was still research after all, and it was trained a relatively small amount of compute. After that, they had to build infrastructure to serve the models via API and a new supercomputing infrastructure to train new models with 100x compute of GPT-3 in an efficient way.
The fact that we are openly hearing rumours of GPT-5 being trained and nobody is denying them, it means that it is likely that they will ship a new version every year or so from now on.
There are many easy ways to incorporate vision. Vision+text models are a dime a dozen these days—as I said, this currently looks like ‘DALL-E 1 but bigger’ (VQVAE tokens → token sequence → autoregressive modeling of text/image tokens). What we have seen so far doesn’t look like 3 years of progress by the best DL researchers.
OpenAI has transitioned from being a purely research company to an engineering one. GPT-3 was still research after all, and it was trained a relatively small amount of compute. After that, they had to build infrastructure to serve the models via API and a new supercomputing infrastructure to train new models with 100x compute of GPT-3 in an efficient way.
The fact that we are openly hearing rumours of GPT-5 being trained and nobody is denying them, it means that it is likely that they will ship a new version every year or so from now on.