The closest we can get is a little benchmark where the models are supposed to retrieve “a needle out of a haystack”. Stuff like a big story of 1 million tokens and they are supposed to retrieve a fact from it
This isn’t “the closest we can get”. Needle-in-a-haystack tests seem like a sensible starting point, but testing long-context utilization in general involves synthesis of information, EG looking at a novel or series of novels and answering reading comprehension questions. There are several benchmarks of this sort, EG:
https://epoch.ai/benchmarks/fictionlivebench
Yes, I think what I’ve described here shares a lot with Bengio’s program.