While this worked well, for even a slightly more complicated problem the model failed. One Twitter user suggested just adding a random ‘iPhone 15’ in the book text and then asking the model if there is anything in the book that seems out of place in the book. And the model failed to locate it.
The same was the case when the model was asked to summarize a 30-minute Mr. Beast video (over 300k tokens). It generated the summary but many people who had watched the video pointed out that the summary was mostly incorrect.
So while on paper this looked like a huge leap forward for Google, it seems that in practice it’s not performing as well as they might have hoped.
But is this due to limitations of RLHF training, or something else?
Apparently Gemini 1.5 Pro isn’t working great with large contexts:
But is this due to limitations of RLHF training, or something else?