sensitivity of benchmarks to prompt variations introduces inconsistencies in evaluation
When evaluating human intelligence, random variation is also something evaluators must deal with. Psychometricians have more or less solved this problem by designing intelligence tests to include a sufficiently large battery of correlated test questions. By serving a large battery of questions, one can exploit regression to the mean in the same way that samples from a distribution in general can arrive at an estimate of a population mean from samples.
I suppose the difference between AI models and humans is that through experience we know that the frontier of human intelligence can be more or less explored by such batteries of tests. In contrast, you never know when an AI model (an “alien mind” as you’ve written before) has an advanced set of capabilities with a particular kind of prompt.
The best way to solve this problem I can imagine to try to understand the distribution under which AIs can produce interesting intelligence. With the LLM Ethology approach this does seem to cache out to: perhaps there are predictable ways that high-intelligence results can be elicited. We have already discovered a lot about how current LLMs have and how best to elicit the frontier of their capabilities.
I think this underscores the question: how much can we infer about capabilities elicitation in the next generation of LLMs from the current generation? Given the widespread use, the current generation is implicitly “crowdsourced” and we get a good sense of their frontier. But we don’t have the opportunity to fully understand how to best elicit capabilities in an LLM before it is thoroughly tested. Any one test might not be able to discover the full capabilities of a model because no test can anticipate the full distribution. But if the principles for eliciting full capabilities are constant from one generation to the next, perhaps we can apply what we learned about the last generation to the next one.
When evaluating human intelligence, random variation is also something evaluators must deal with. Psychometricians have more or less solved this problem by designing intelligence tests to include a sufficiently large battery of correlated test questions. By serving a large battery of questions, one can exploit regression to the mean in the same way that samples from a distribution in general can arrive at an estimate of a population mean from samples.
I suppose the difference between AI models and humans is that through experience we know that the frontier of human intelligence can be more or less explored by such batteries of tests. In contrast, you never know when an AI model (an “alien mind” as you’ve written before) has an advanced set of capabilities with a particular kind of prompt.
The best way to solve this problem I can imagine to try to understand the distribution under which AIs can produce interesting intelligence. With the LLM Ethology approach this does seem to cache out to: perhaps there are predictable ways that high-intelligence results can be elicited. We have already discovered a lot about how current LLMs have and how best to elicit the frontier of their capabilities.
I think this underscores the question: how much can we infer about capabilities elicitation in the next generation of LLMs from the current generation? Given the widespread use, the current generation is implicitly “crowdsourced” and we get a good sense of their frontier. But we don’t have the opportunity to fully understand how to best elicit capabilities in an LLM before it is thoroughly tested. Any one test might not be able to discover the full capabilities of a model because no test can anticipate the full distribution. But if the principles for eliciting full capabilities are constant from one generation to the next, perhaps we can apply what we learned about the last generation to the next one.