I’ve seen a lot about GPT4o being kinda bad, and I’ve experienced that myself. This surprises me.
Now I will say something that feels like a silly idea. Is it possible that having the audio/visual part of the network cut off results in 4o’s poor reasoning?
As in, the whole model is doing some sort of audio/visual reasoning. But we don’t have the whole model, so it can’t reason in the way it was trained to.
If that is the case, I’d expect that when those parts are publicly released, scores on benchmarks shoot up?
Do people smarter and more informed than me have predictions about this?
Without a detailed Model Card for 4o it is impossible to know ” for sure” why models drift in performance over time, but drift they do.
It is entirely possible that Open-AI started with a version of GPT-4 Turbo, parallelize processing and performed an extensive “fine tune” to improve the multi-modal capabilities.
Essentially, the model could “forget” how to complete prompuppies. Workhfrom just a week ago, because some of its “memory” was over-written with instructions to complete requests for multi-modal replies.
I’m confused by what you mean that GPT-4o is bad? In my experience it has been stronger than plain GPT-4, especially at more complex stuff. I do physics research and it’s the first model that can actually improve the computational efficiency of parts of my code that implement physical models. It has also become more useful for discussing my research, in the sense that it dives deeper into specialized topics, while the previous GPT-4 would just respond in a very handwavy way.
Man, I wish that was my experience. I feel like I’m constantly asking GPT4o a question, getting a weird or bad response. Then switching to 4 to finish the job.
Benchmarks are consistent with GPT-4o having different strengths than GPT4-Turbo, though at a similar overall level—EQ-Bench is lower, MAGI-Hard is higher, best tested model for Creative Writing according to Claude Opus, but notably worse at judging writing (though still good for its price point).
In my experience different strengths also mean different prompt strategies are necessary; a small highly instruction-focused model might benefit from few-shot repetition and emphasis that just distract a more powerful OpenAI model for example. Which might make universal custom instructions more annoying.
I’ve seen a lot about GPT4o being kinda bad, and I’ve experienced that myself. This surprises me.
Now I will say something that feels like a silly idea. Is it possible that having the audio/visual part of the network cut off results in 4o’s poor reasoning? As in, the whole model is doing some sort of audio/visual reasoning. But we don’t have the whole model, so it can’t reason in the way it was trained to.
If that is the case, I’d expect that when those parts are publicly released, scores on benchmarks shoot up?
Do people smarter and more informed than me have predictions about this?
Without a detailed Model Card for 4o it is impossible to know ” for sure” why models drift in performance over time, but drift they do.
It is entirely possible that Open-AI started with a version of GPT-4 Turbo, parallelize processing and performed an extensive “fine tune” to improve the multi-modal capabilities.
Essentially, the model could “forget” how to complete prompuppies. Workhfrom just a week ago, because some of its “memory” was over-written with instructions to complete requests for multi-modal replies.
I’m confused by what you mean that GPT-4o is bad? In my experience it has been stronger than plain GPT-4, especially at more complex stuff. I do physics research and it’s the first model that can actually improve the computational efficiency of parts of my code that implement physical models. It has also become more useful for discussing my research, in the sense that it dives deeper into specialized topics, while the previous GPT-4 would just respond in a very handwavy way.
Man, I wish that was my experience. I feel like I’m constantly asking GPT4o a question, getting a weird or bad response. Then switching to 4 to finish the job.
Benchmarks are consistent with GPT-4o having different strengths than GPT4-Turbo, though at a similar overall level—EQ-Bench is lower, MAGI-Hard is higher, best tested model for Creative Writing according to Claude Opus, but notably worse at judging writing (though still good for its price point).
In my experience different strengths also mean different prompt strategies are necessary; a small highly instruction-focused model might benefit from few-shot repetition and emphasis that just distract a more powerful OpenAI model for example. Which might make universal custom instructions more annoying.